├── mLongT5-2305.11129.pdf
├── Optimization
├── 2302.13971.pdf
├── 2305.11206.pdf
├── 63bd8a1f20da862484184fdb_blog extractive .png
├── The Best Medium Article Formatting Guide _ by Casey Botticello _ Blogging Guide _ Medium.pdf
├── read.me
├── write-follow-clap-comment-highloght-repeat.txt
├── GPT4VsLima.txt
├── extractive-vs-abstractive-summarization-in-healthcare.txt
├── nlp-basics-abstractive-and-extractive-text-summarization.txt
├── BERTexplanation.txt
├── WhyOneMinuteManagers.txt
├── understanding-automatic-text-summarization-2.txt
├── understanding-automatic-text-summarization-1.txt
└── AutomaticTextSummarization.txt
├── LaMini-LM_paper_2304.14402.pdf
├── Efficient Prompting via Dynamic In-Context Learning-2305.11170.pdf
├── Evaluating Open-Domain Question Answering in the Era of Large Language Models_paper_2305.06984.pdf
├── Musketeer_All for One and One for All_paper_A Generalist Vision-Language Model with Task Explanation Prompts-2305.07019.pdf
├── README.md
├── LaMini paper.txt
├── GPT Agents and AGI.txt
├── largeModelsOrGoodData.txt
├── semanticsearcha_praticaloverview.txt
├── LLMonYourDomain.txt
└── LaMini-LM - Mini Models Maxi Data! [English] [DownloadYoutubeSubtitles.com].txt
/mLongT5-2305.11129.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/mLongT5-2305.11129.pdf
--------------------------------------------------------------------------------
/Optimization/2302.13971.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Optimization/2302.13971.pdf
--------------------------------------------------------------------------------
/Optimization/2305.11206.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Optimization/2305.11206.pdf
--------------------------------------------------------------------------------
/LaMini-LM_paper_2304.14402.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/LaMini-LM_paper_2304.14402.pdf
--------------------------------------------------------------------------------
/Optimization/63bd8a1f20da862484184fdb_blog extractive .png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Optimization/63bd8a1f20da862484184fdb_blog extractive .png
--------------------------------------------------------------------------------
/Efficient Prompting via Dynamic In-Context Learning-2305.11170.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Efficient Prompting via Dynamic In-Context Learning-2305.11170.pdf
--------------------------------------------------------------------------------
/Evaluating Open-Domain Question Answering in the Era of Large Language Models_paper_2305.06984.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Evaluating Open-Domain Question Answering in the Era of Large Language Models_paper_2305.06984.pdf
--------------------------------------------------------------------------------
/Optimization/The Best Medium Article Formatting Guide _ by Casey Botticello _ Blogging Guide _ Medium.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Optimization/The Best Medium Article Formatting Guide _ by Casey Botticello _ Blogging Guide _ Medium.pdf
--------------------------------------------------------------------------------
/Musketeer_All for One and One for All_paper_A Generalist Vision-Language Model with Task Explanation Prompts-2305.07019.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMiniPower/main/Musketeer_All for One and One for All_paper_A Generalist Vision-Language Model with Task Explanation Prompts-2305.07019.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LaMiniPower
2 | The power of a small LLM for your Summarization and domain knowledge QnA
3 |
4 | Download the Google Colab Notebook
5 | Or [Click here to open it in Google Colab](http://colab.research.google.com/github/fabiomatricardi/LaMiniPower/blob/main/Study_of_Summarization_with_FlanT5LaMini.ipynb)
6 |
7 |
8 |
--------------------------------------------------------------------------------
/Optimization/read.me:
--------------------------------------------------------------------------------
1 | # Content
2 | This folder contains 4 text documents from Medium Articles
3 | to run Summarization tests on them.
4 | ## Purpose
5 | LangChain Summarization Chain has 3 styles of application:
6 | - map_reduce
7 | - stuff
8 | - refine
9 | Summarization Pipeline instead use Transformers Library: the technique here
10 | consists in chunking, summarize the chunk with overlap, for the attention, and
11 | join the small Summarizations into one.
12 | ## Results
13 | - PipeLine Summarization is excellent for a longer Summarization
14 | - Langchain Summarization with refine style provide a more concise but precise Summarization, such as an abstract
15 |
--------------------------------------------------------------------------------
/Optimization/write-follow-clap-comment-highloght-repeat.txt:
--------------------------------------------------------------------------------
1 | Title: Write, follow, clap, comment, highlight, repeat | by John Pearce | Medium
2 | --------------------------------------------------------------------------------
3 |
4 | author is John Pearce
5 |
6 | That seems to be the magic sauce in the Medium recipe, based on reading of lots of article on how to succeed.
7 |
8 | Write
9 |
10 | That’s what we’re here for, and unless you are a “lurker”, (like one of my brothers who has been on Facebook for five years without gaining a single friend, which is a bit sad but that is another tale). Regular posting of stories seems to stir the pot, and get us noticed. We all get stuck for ideas at times, and one suggestion is to use a random word/phrase generator, or if you come across a great new word, like jeremiad or bellwether, write an article on that. It is a fun challenge! It also seems to pay to tag lots of different topics with different articles, to pick up readers from across the spectrum of Medium, rather than just the needlecraft group of three. Mixing the ingredients makes for surprising results, which may be why I am only allowed to cook baked potatoes or pasta at home.
11 |
12 | Follow
13 |
14 | There seem to be mixed views on this. Some swear by “follow for follow” as a way to build numbers; at first I saw it as a bit like cheating, but came round to the view that it is reciprocal altruism in practice, and therefore a good thing, and great at building connections and a community among new members. Others regard “follow for follow” as a flawed strategy, as it may attract followers with little interest in your Morris Dancing niche, and instead prefer organic growth of genuine followers. It may be my writing, but I was only picking up a handful this way!
15 |
16 | Clap
17 |
18 | Just as “follow for follow” helps everyone climb the ladder, it is good to be nice, so if someone appreciates your writing, encourage them as well. To start with I didn’t realise you could clap up to 50 times, mistaking the clap symbol for being similar to the like feature in Facebook. Apologies for being a Grinch.
19 |
20 | Comment
21 |
22 | There seems wide agreement that engaging with other writers is good at building followers. It takes time, but it is beneficial for building the Medium community, so recommended.
23 |
24 | Highlight
25 |
26 | Not sure about this, as not everyone wants to see a particular phrase or paragraph of their work highlighted, but again it seems recommended, and shows appreciation of other writers’ work.
27 |
28 | Repeat
29 |
30 | Yes, back to writing. It is the purpose of Medium, and why most of us are here, except the lurkers!
31 |
32 | So to succeed with your Medium masala, the magic sauce seems to be write, follow, clap, comment, highlight, repeat.
33 |
--------------------------------------------------------------------------------
/Optimization/GPT4VsLima.txt:
--------------------------------------------------------------------------------
1 | Title: GPT-4 vs LIMA: Rethinking Large Language Models for Efficiency and Performance | by Amir Shakiba | May, 2023 | Medium
2 |
3 | GPT-4 vs LIMA: Rethinking Large Language Models for Efficiency and Performance
4 | written by Amir Shakiba
5 |
6 |
7 | A recent paper by Meta AI has the potential to revolutionize our understanding of large language models (LLMs).
8 |
9 | To delve into their workings, let’s take a closer look at Meta AI’s LLAMA model. (you can jump straight to LIMA if you want)
10 |
11 | LLAMA
12 |
13 | LLMs, which are trained on vast amounts of text, have given us impressive results. Initially, it was believed that bigger models are necessary for better performance. However, recent papers suggest that smaller models trained on more data can actually deliver better results, challenging the notion of model size. Importantly, practical considerations come into play. In terms of production efficiency, it is more advantageous to train a smaller model for a longer duration, rather than opting for a larger model trained in a shorter time frame that requires more GPU resources during inference.
14 |
15 | smaller models on larger datasets means less cost and more affordability leading to democratization of AI which OpenAI is really concerned about!
16 |
17 | This is where LLAMA models come in. Despite having fewer parameters compared to GPT-3 models, LLAMA models can run on a single GPU. Additionally, LLAMA models are exclusively trained on openly accessible datasets, in contrast to other systems like ChatGPT, which rely on data that is not publicly available.(openAI or closeAI?:)
18 |
19 | LIMA
20 |
21 | Now let’s shift our focus to LIMA, a new LLAMA model developed by Meta AI. LLMs undergo two distinct stages of training. Firstly, they are trained on massive amounts of data to acquire general-purpose representations. Secondly, instruction tuning or reinforcement learning is employed to guide the model for specific tasks. Notably, reinforcement learning from human feedback (RLHF) has been championed by OpenAI as a crucial aspect of training models like ChatGPT. However, this new study suggests that RLHF has limited impact on training. The majority of learning occurs during pretraining and training on the massive text corpus.
22 |
23 | If humans are not even needed for their feedback ,what makes them useful?
24 |
25 | LIMA, a 65B-parameter LLAMA model, stands out as it is trained on only 1,000 precise prompts. Remarkably, LIMA achieves competitive results comparable to GPT-4, Claude, or Bard. This highlights the power of pretraining and diminishes the significance of large-scale instruction tuning and reinforcement learning approaches.
26 |
27 | In summary, Meta AI’s research sheds light on the potential of LLAMA models and challenges the conventional understanding of LLMs. The focus on training smaller models on larger datasets and the limited role of reinforcement learning in training highlight the efficiency and effectiveness of this approach. LIMA exemplifies the promising capabilities of LLAMA models and their ability to achieve impressive performance with significantly fewer parameters.
28 |
29 | our understanding of billion parameter world is too little , there’s more to discover.
30 |
31 | link to the papers:
32 |
33 | LLAMA: https://arxiv.org/pdf/2302.13971.pdf
34 | LIMA : https://arxiv.org/pdf/2305.11206.pdf
--------------------------------------------------------------------------------
/Optimization/extractive-vs-abstractive-summarization-in-healthcare.txt:
--------------------------------------------------------------------------------
1 | Title: Extractive vs Abstractive Summarization in Healthcare
2 | ------------------------------------------------------------
3 | source: https://www.abstractivehealth.com/extractive-vs-abstractive-summarization-in-healthcare
4 | author is Vince Hartman
5 |
6 | Extractive vs Abstractive Summarization in Healthcare
7 | There are two approaches to summarize information: extractive summarization which copies the most relevant sentences from a text, and abstractive summarization which generates new sentences. Abstractive summarization is the most promising method for automated text summarization and has recently been possible thanks to the advancement of the NLP transformer models.
8 |
9 | Summarizing text is surprisingly hard. While there are an infinite number of ways you can distill text into the most important parts, doing it well requires you to master conciseness, coherence and comprehension. We create summaries regularly for numerous activities such as academic papers, wikipedia entries, movies, books, legal documents, business ideas, and even ourselves (the “tell me about yourself” in interviews). It’s no wonder that an automated summary has been pursued for over 70 years in the fields of statistics and computer science, and 20 years within healthcare. Recently, results for automated computerized summaries have been impressive. And for the first time, a good abstractive summary is now possible in healthcare.
10 |
11 | So what is abstractive summarization? In the field of summarization, there are two approaches: extractive and abstractive methods. Extractive summarization copies (or extracts) the most important words or phrases from a text to concatenate the content: i.e. imagine selecting the top 3 sentences in a document and presenting those as the summary. Abstractive summarization generates new sentences that never existed by synthesizing the salience of the original text: i.e. paraphrasing the central idea in your own words.
12 |
13 | An obvious problem with extractive summarization is that it lacks fluency; the sentences don’t flow naturally from one sentence to the next. It is generally jarring since there are no transitions between topics and the next sentence. Secondly and most importantly, the main idea of the text might be buried in the original source text and thus cannot be captured in one individual sentence, so comprehension might be lacking. Extractive summarization generally works well for a structured source text, like a news article, where the author presents the most important content to the reader in a key thesis sentence (that topic statement we were trained to write for the five-paragraph essay). Where extractive methods fail is for more artistic and unstructured text when the main idea is a crescendo over numerous pages. Such as when we read a great novel and come to understand the main idea as we reach the denouement. For example, extraction would work poorly for a novel like Moby Dick which opens with the iconic line “Call me Ishmael''. While a beautiful and popular line, the sentence by itself provides little context that the novel is ultimately about the destructive nature of Ahab’s obsessive quest of a gigantic sperm whale.
14 |
15 | In healthcare, extractive summarization is great to get the high level diagnoses, allergies, and past procedures for a patient, and then structure all that content into a simple rule-based algorithm. Some general weakness of this approach is the summary reads like a computer wrote it and maintaining all those rules becomes quickly difficult. The most glaring weakness is that these summaries lack context for how the patient is progressing over their treatment. For example, about 10% of the US population has type 2 diabetes, it’s a very common disease; but the disease can be life threatening if not managed properly. A typical extractive summary of a patient would inform you that a patient has the ICD-10 code for diabetes, but it provides no context if the patient has been managing their blood sugar levels well or is at risk for hospitalization. The course of their treatment is not captured in the extractive summary and the physician is still left to search through the hundreds of notes to understand their patient. This is where abstractive summarization techniques excel.
16 |
17 | Abstractive summarization is relatively new in healthcare and has coincided with the advancement of the NLP transformer models that have taken off since 2017 (the release of BERT). Because healthcare is particularly challenging, the only commercial applications to date are for automating the radiology impression section for radiologists. The impression section summarizes the key findings of a radiology report, so a computerized version saves the radiologists’ time by not needing to manually write out that summary. The findings section of a radiology report is generally less than 500 words, so a computerized summary does not have to worry about the challenges of longform documentation (i.e. summarizing thousands of words from the whole medical record). That said, these commercial applications still need to address other challenges with a good factual summary in healthcare designed for an individual physician; so the technology is definitely impressive.
18 |
19 | With our current pilot with Abstractive Health with Weill Cornell Medical Center, we are building the first commercial abstractive summary of the full patient record in healthcare (so hundreds of notes and not just the radiology report). Our summarization structure is based on those same NLP transformer models from 2017 with some significant modifications. And one of our core research assessments that we are demonstrating is that our automated summary of the patient chart is a close equivalent to a physician written summary. Thus, our tool could be used as a supplement for physicians at patient admission, transfer, and discharge workflows.
20 |
--------------------------------------------------------------------------------
/LaMini paper.txt:
--------------------------------------------------------------------------------
1 | Large language models (LLMs) with instruction finetuning demonstrate superior generative capabilities. However, these models are resource intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs to much smaller ones. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizeable, we design our instructions to cover a broad set of topics to ensure. A thorough investigation of our instruction data demonstrate their diversity, and we generate responses for these instructions using gpt-3.5-turbo. We then exploit the instructions to tune a host of models, dubbed LaMini-LM, of varying sizes, both from the encoder-decoder as well as the decoder-only families. We evaluate our models both automatically (on 15 different NLP benchmarks) and manually. Results show that our proposed LaMini-LM are on par with competitive baselines while being nearly 10 times smaller in size.
2 |
3 |
4 | Large language models (LLMs) with instruction
5 | tuning are capable of generating remarkable outputs for a wide range of use cases. However, these models usually have billions of parameters, which require massive computational resources for both training and inference. Kaplan et al. (2020) suggest
6 | that the performance of LLMs scales proportionally with model and dataset size. Consequently,
7 | scaling the models raises many issues such as those related to the energy footprint.
8 | Moreover, the accessibility of large models is a real concern for many NLP practitioners due to limited access to computing resources.
9 | In this work, we present LaMini-LM, a collection of language models that are notably smaller
10 | in size than most existing instruction-tuned models. We develop LaMini-LM models by employing
11 | sequence distillation (also known as offline distillation) from LLMs.
12 | Although similar attempts have been made in recent work, there are several gaps in this literature that we aim to address. Specifically, these works
13 | often provide a small-scale distilled dataset that is not necessarily diverse, and a limited number of models (typically only one), without comprehensive evaluation nor analysis of the models’ performance.
14 | Furthermore, many of the distilled models resulting from prior work tend to still be relatively computationally intensive. That is, parameters of these recent models usually range
15 | from 7B to 13B, making them difficult to deploy in resource-constrained settings especially for underresourced institutions. To alleviate these issues, we firstly generate a large-scale offline distillation dataset comprising 2.58M instructions, and then fine-tune a collection of language models to obtain the LaMiniLM models, as shown in Figure 1.
16 | We collate instructions from various prior datasets such as
17 | self-instruct, P3, FLAN and Alpaca. Additionally, we use ChatGPT (gpt-3.5-turbo) to generate supplementary instructions, with an emphasis on diversity that adheres to the existing human-written instructions in the prompt. This approach is known as Example-Guided Instruction Generation.
18 | To further increase the diversity in the generated text, we also introduce the Topic-Guided Instruction Generation method. Subsequently, we use gpt-3.5-turbo to generate responses for each instruction.
19 | After generating the dataset, we fine-tune several smaller language models with varying sizes
20 | (from 61M to 1.5B) and architectures (encoderdecoder and decoder-only). Furthermore, we compare different variations of models with the same architecture. Our work is also distinguished from previous research by providing a comprehensive evaluation of the resulting models. We assess the performance of the models on various NLP downstream tasks, in addition to manual human evaluation of the model’s outputs. This analysis offers a more in-depth understanding of the models’ strengths and weaknesses.
21 | Our contributions can be summarized as follows:
22 | 1. We release a large-scale instruction dataset that contains over 2.58M examples. To the
23 | best of our knowledge, this dataset is the largest instruction dataset currently available
24 | in the NLP literature. Our instruction dataset is ×50 larger than the one released by Taori
25 | et al. (2023).
26 | 2. We explore the process of distilling knowledge from LLMs to various much smaller
27 | model architectures, resulting in a family of distilled language models. Our largest model
28 | and smallest model are ×110 and ×2800 smaller than GPT-3 (Brown et al., 2020), respectively.
29 | 3. We conduct extensive experiments on both our proposed models and several publicly available LLMs. These experiments include automatic evaluation on 15 NLP tasks and human evaluation. Both our automatic and human evaluations show that our proposed models achieve comparable performance with Alpaca while being nearly ×10 smaller in size.
30 | Knowledge Distillation
31 | Knowledge distillation is a process used to train a smaller model, referred to as the student, by learning from a larger model, known as the teacher. One of the most commonly used methods of knowledge distillation involves training the student with an additional objective of matching the teacher’s representation, such as logits, output probability, or intermediate activation. For sequence-to-sequence or generative models, Kim and Rush (2016) introduced the concept of sequence-level distillation. This approach involves generating a synthetic output by running inference with the teacher model, which is then used to train the student model. This method is more efficient as it only requires running the usually large teacher model once. Previous research has demonstrated the effectiveness of sequence-level distillation. For
32 | instance, Costa-jussà et al. (2022) used sequencelevel distillation to reduce the size of an NLLB machine translation system to 600M parameters. Similarly, by combining sequence-level distillation with model pruning and quantization. Bogoychev and others in 2020 managed to train a translation system that was approximately ×25 smaller than the teacher model without a significant decrease in BLEU score. In our work, we train our model on the output of gpt-3.5-turbo, which can be viewed as a sequence-level distillation approach. While other researchers also train language models based on the output of GPT models, our work is distinct in
33 | that we train our model on a considerably larger dataset and distilled it into much smaller models. Furthermore, we provide various student models.
34 |
35 | Encoder-Decoder vs. Decoder-Only
36 | The encoder-decoder LaMini language models (LaMiniT5 series and LaMini-Flan-T5 series) outperform the decoder-only LaMini language models (LaMini-GPT series) when the number of
37 | parameters is limited (less than 500M parameters). LaMini-Flan-T5-248M even outperforms
38 | LLaMa-7B on downstream NLP tasks. When the model size is higher, LaMini-Flan-T5 is comparable to LaMini-GPT. Yet, both LaMini-Flan-T5 and LaMini-T5 demonstrate strong human evaluation
39 | results for user-oriented instructions, despite their relatively small size.
40 | Especially, T5-based models of 200M parameters is competitive against LaMiniGPT-1.5B for human evaluation result. We recommend further exploration of the encoder-decoder architecture for language models, given their potential, as demonstrated in our experiments.
41 | Conclusion
42 | In this work, we release a large-scale instruction dataset distilled from ChatGPT with more than 2.58M examples. To the best of our knowledge, this dataset is currently the largest dataset of its kind.
43 | We explore distilling knowledge from LLMs to various smaller and more efficient model architectures. We refer to the resulting family of language models as LaMini, which includes 6 encoder-decoder models and 9 decoder-only models with varying model sizes. We also conduct a comprehensive evaluation in this work, including the automatic evaluation of the downstream NLP tasks and human evaluation.
44 | Both evaluation strategies highlight that our proposed models achieve comparable performance
45 | with Alpaca while is nearly ×10 smaller in size. This work sheds light on distilling knowledge from LLMs to much smaller model architectures and demonstrates the potential
46 | of training efficient yet effective language models.
47 |
48 |
--------------------------------------------------------------------------------
/Optimization/nlp-basics-abstractive-and-extractive-text-summarization.txt:
--------------------------------------------------------------------------------
1 | Title: NLP Basics: Abstractive and Extractive Text Summarization
2 | ----------------------------------------------------------------
3 | source: https://www.scrapehero.com/nlp-basics-abstractive-and-extractive-text-summarization/
4 |
5 | Summarization is one of the most common tasks that we perform in Natural Language Processing (NLP). With the amount of new content generated by billions of people and their smartphones everyday, we are inundated with increasing amount of data every day. Humans can only consume a finite amount of information and need a way to filter out the wheat from the chaff and find the information that matters. Text summarization can help achieve that for textual information. We can separate the signal from the noise and take meaningful actions from them.
6 |
7 | In this article, we explore different methods to implement this task and some of the learnings that we have come across on the way. We hope this will be helpful to other folks who would like to implement basic summarization in their data science pipeline for solving different business problems.
8 |
9 | Python provides some excellent libraries and modules to perform Text Summarization. We will provide a simple example of generating Extractive Summarization using the Gensim and HuggingFace modules in this article. We will explore other models and modules in upcoming articles in this series.
10 |
11 | When to use Summarization?
12 |
13 |
14 | It may be tempting to use summarization for all texts to get useful information from them and spend less time reading. However, for now, NLP summarization has been a successful use case in only a few areas.
15 |
16 | Text summarization works great if a text has a lot of raw facts and can be used to filter important information from them. The NLP models can summarize long documents and represent them in small simpler sentences. News, factsheets, and mailers fall under these categories.
17 |
18 | However, for texts where each sentence builds up upon the previous, text summarization does not work that well. Research journals, medical texts are good examples of texts where summarization might not be very successful.
19 |
20 | Finally, if we take the case of summarizing fiction, summarization methods can work fine. However, it might miss the style and the tone of the text that the author tried to express.
21 |
22 | Hence, Text summarization is helpful only in a handful of use cases.
23 |
24 | Two Types Of Summarization
25 |
26 | There are two main types of Text Summarizations
27 |
28 | Extractive
29 |
30 | Extractive summarization methods work just like that. It takes the text, ranks all the sentences according to the understanding and relevance of the text, and presents you with the most important sentences.
31 |
32 | This method does not create new words or phrases, it just takes the already existing words and phrases and presents only that. You can imagine this as taking a page of text and marking the most important sentences using a highlighter.
33 |
34 | Abstractive
35 |
36 | Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and presents the meaning to you.
37 |
38 | It creates words and phrases, puts them together in a meaningful way, and along with that, adds the most important facts found in the text. This way, abstractive summarization techniques are more complex than extractive summarization techniques and are also computationally more expensive.
39 |
40 | Comparison of both summarization types
41 |
42 | The best way to illustrate these types is through an example. Here we have run the Input Text below through both types of summarization and the results are shown below.
43 |
44 | Input Text:
45 |
46 | China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.
47 |
48 | Extractive Summarization Output:
49 |
50 | While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
51 |
52 | Abstractive Summarization Output:
53 |
54 | Huawei overtakes Samsung as world’s biggest seller of mobile phones in the second quarter of 2020. Sales of Huawei’s 55.8 million devices compared to 53.7 million for south Korea’s Samsung. Shipments overseas fell 27 per cent in Q2 from a year earlier, but company increased its dominance of the china market. Position as number one seller may prove short-lived once other markets recover, a senior Huawei employee says.
55 |
56 | Extractive Text Summarization Using Gensim
57 |
58 | Import the required libraries and functions:
59 |
60 | from gensim.summarization.summarizer import summarize
61 |
62 | from gensim.summarization.textcleaner import split_sentences
63 |
64 | We store the article content in a variable called Input (mentioned above). Next, we have to pass it to the summarize function, the second parameter being the ratio we want the summarized text to be. We chose it as 0.4, or the summary will be around 40% of the original text.
65 |
66 | summarize(Input, 0.4)
67 |
68 | Output:
69 |
70 | While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
71 |
72 | With the parameter split=True, you can see the output as a list of sentences.
73 |
74 | Gensim summarization works with the TextRank algorithm. As the name suggests, it ranks texts and gives you the most important ones back.
75 |
76 | Extractive Text Summarization Using Huggingface Transformers
77 |
78 | We use the same article to summarize as before, but this time, we use a transformer model from Huggingface,
79 |
80 | from transformers import pipeline
81 |
82 | We have to load the pre-trained summarization model into the pipeline:
83 |
84 | summarizer = pipeline("summarization")
85 |
86 | Next, to use this model, we pass the text, the minimum length, and the maximum length parameters. We get the following output:
87 |
88 | summarizer(Input, min_length=30, max_length=300)
89 |
90 | Output:
91 |
92 | China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million. Samsung posted a bigger drop of 30 per cent, owing to disruption from coronavirus in key markets such as Brazil, the United States and Europe.
93 |
94 | Where can you get the data from?
95 |
96 | You can scrape news website to get the data to try these summarization techniques. If you aren’t keen on building scrapers to collect this data, you can try our News API for FREE.
97 |
98 | Conclusion
99 |
100 | We saw some quick examples of Extractive summarization, one using Gensim’s TextRank algorithm, and another using Huggingface’s pre-trained transformer model. In the next article in this series, we will go over LSTM, BERT, and Google’s T5 transformer models in-depth and look at how they work to do tasks such as abstractive summarization.
101 |
102 | Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
--------------------------------------------------------------------------------
/Optimization/BERTexplanation.txt:
--------------------------------------------------------------------------------
1 | Title: BERT: A Beginner-Friendly Explanation | by Digitate | May, 2023 | Medium
2 | -------------------------------------------------------------------------------
3 | written By Pushpam Punjabi
4 | author Pushpam Punjabi
5 |
6 | Up until now, we’ve seen how a computer understands the meaning of different words using word embeddings. In the last blog, we also looked at how we can take average of the embeddings of words appearing in a sentence to represent that sentence as an embedding. This is one of the ways of interpreting a sentence. But that’s not how humans understand the language. We don’t just take individual meaning of words and form the understanding of a sentence or a paragraph. A much more complex process is involved to understand language by humans. But how does a machine understand language? It’s through language models!
7 |
8 | Language models are an essential component of Natural Language Processing (NLP), designed to understand and generate human language. They use various statistical and machine learning techniques to analyze and learn from large amounts of text data, enabling them to identify patterns and relationships between words, phrases, and sentences. Word embeddings form the base in understanding these sentences! Language models have revolutionized the field of NLP and have played a crucial role in enabling machines to interact with humans in a more natural and intuitive way. Language models have also surpassed humans in some of the tasks in NLP!
9 |
10 | In this blog, we will understand Bi-directional Encoder Representations from Transformers (BERT) which is one of the biggest milestones in the world on language models!
11 |
12 | Understanding BERT
13 |
14 | BERT was developed by Google in 2018. It is a “Language Understanding” model, that is trained on a massive amounts of text data to understand the context and meaning of words and phrases in a sentence. BERT uses “transformer” deep learning architecture that enables it to process information bidirectionally, meaning it can understand the context of a word based on both, the words that come before and after it. This allows BERT to better understand the nuances of language, including idioms, sarcasm, and complex sentence structures.
15 |
16 | You must be wondering how do you train such models to understand human language? There are 2 training steps involved to use BERT:
17 |
18 | Pre-training phase
19 | Fine-tuning phase
20 | 1. Pre-training phase
21 |
22 | In pre-training phase, the model is trained on huge textual data. This is the stage where the model learns and understand the language. Pre-training is expensive. To pre-train a BERT model, Google used multiple TPUs — special computing processors for deep learning models. It took them 4 days to pre-train BERT on such a large infrastructure. But this is only a one-time procedure. Once the model understands the language, we can reuse the model for variety of tasks in NLP. There are 3 steps to pre-train BERT:
23 |
24 | Text corpus selection
25 | Masked Language Modeling
26 | Next Sentence Prediction
27 |
28 | Let’s go through each step in detail.
29 |
30 | 1.1 Text Corpus Selection
31 |
32 | Before I talk about data, we must understand that these models are huge is size. Not only the size on the disk, but the mathematical parameters we need to calculate inside these deep learning models as well. To give you some perspective, the largest BERT model is of size 1.4 GB on disk, if it is saved as a binary file!
33 |
34 | For the text corpus selection, you need to have some considerations around the text you want to use:
35 |
36 | · Size of the corpus
37 |
38 | · Domain of the text
39 |
40 | · Language of the text
41 |
42 | For BERT, we stick to English language. BERT is trained on combination of 2 datasets, the whole English Wikipedia dump, and BookCorpus, which is collection of free ebooks. These datasets are general datasets, which do not talk about any specific domain. If the raw text of these datasets would be stored in a .txt file, then the size would be in GBs!
43 |
44 | To train any deep learning model, we need annotated data. The dataset which we have mentioned is just raw text. To annotate such a huge text data for any task, a lot of manpower would be required. The researchers have designed a self-supervised way to create 2 tasks and train the transformer model on those tasks.
45 |
46 | 1.2 Masked Language Modeling
47 |
48 | BERT is first trained as a Masked Language Model (MLM) to understand a sentence in both directions of context — left to right and right to left. Essentially, BERT is given an input sequence, where 15% of the words are masked. The task for BERT is to predict these masked words, by reading both, the left-side, and the right-side context of the masked word.
49 |
50 | In this example, 2 words are masked — store and gallon. BERT must predict both the words correctly. These 15% of the words are randomly selected. Thus, in a self-supervised manner, all the raw text is now annotated for the task of predicting masked words.
51 |
52 | One of the benefits of MLM is that it enables BERT to understand language in a more natural and nuanced way. By predicting the missing words in a sentence, BERT can better understand the context and meaning of the words that are present. This can be especially useful for applications such as sentiment analysis, where understanding the meaning and tone of a sentence is crucial for accurately interpreting its sentiment.
53 |
54 | 1.3 Next Sentence Prediction
55 |
56 | Masked Language Modeling helps BERT in understanding the relationship between words. But what about relationship between various sentences in a paragraph? The task — Next Sentence Prediction helps BERT in understanding relationship between the sentences. This is a simple task which can be generated in a self-supervised way, from any text corpus. The task: Given two sentences A and B, is B the actual sentence that comes after A, or just a random sentence from the text data?
57 |
58 | Next sentence prediction is a useful technique for a variety of NLP tasks. By understanding the relationships between sentences, BERT can better understand the overall meaning and context of a passage of text. This can be especially important for applications such as chatbots or virtual assistants, where the ability to understand and interpret human language is crucial for providing accurate and helpful responses.
59 |
60 | 2. Fine-tuning phase
61 |
62 | After we have pre-trained the BERT model, we can now fine-tune it for any task in NLP. We can now use domain specific dataset in the same language to take advantage of the learnings and understanding of the model for that language. We don’t require a large dataset now for fine-tuning a BERT model. Thus, this process is inexpensive — A few hours on a single GPU would suffice for fine-tuning the model.
63 |
64 | The goal of fine-tuning is to further optimize the BERT model to perform well on a specific task, by adjusting its parameters to better fit the data for that task. For E.g., a BERT model that has been pre-trained on a large corpus of text data can be fine-tuned on a smaller dataset of movie reviews to improve its ability to accurately predict the sentiment of a given review.
65 |
66 | Fine-tuning a BERT model is a powerful tool for a variety of NLP applications, as it enables the model to be tailored to specific tasks and datasets. By fine-tuning a BERT model, researchers and developers can achieve higher levels of accuracy and performance on specific tasks, which can ultimately lead to more effective and useful natural language processing applications.
67 |
68 | Domain specific pre-training
69 |
70 | We used a generic English language text dataset to pre-train the BERT model. This gives us an edge as the model understands the language, but it doesn’t understand the domain. E.g., if we want to use a language model in medical domain, then it must understand meaning and context of the medical terms, procedures, etc.
71 |
72 | For this, we can pre-train the model on a very specific domain, like medicine, in the same language. This increases the accuracy further when we fine-tune the model for a specific task, in the same domain. One of the examples of such a language model is BioBERT. BioBERT is a language model which is pre-trained on huge biomedical text corpus. It has shown increased accuracies over generic BERT on tasks involving biomedical domain. Similarly, we can pre-train the BERT model on text of any domain, which is required by the business usecase.
73 |
74 | Advantages
75 |
76 | · BERT is a highly effective natural language processing model that has achieved state-of-the-art results on a wide range of tasks.
77 |
78 | · BERT uses a unique “transformer” architecture that enables it to better understand the context and meaning of words and phrases in a sentence.
79 |
80 | · BERT can be fine-tuned on specific tasks and datasets, which allows it to be tailored to specific applications and achieve even higher levels of accuracy.
81 |
82 | · BERT is open source and widely available, making it accessible to researchers and developers around the world.
83 |
84 | Limitations
85 |
86 | · BERT requires significant computational resources to pre-train and relatively significant resources to fine-tune, which can be a barrier to entry for smaller research groups or individuals.
87 |
88 | · BERT is trained on large amounts of text data, which can make it difficult to apply to domains or languages with limited data available.
89 |
90 | · BERT can sometimes struggle with understanding context that is not explicitly stated in the text, such as background knowledge or cultural references.
91 |
92 | · BERT is a language model, and as such, it may struggle with tasks that require more than just language understanding, such as tasks that involve visual or audio information.
93 |
94 | Applications
95 |
96 | One of the applications of BERT is extractive question answering. BERT can be fine-tuned on a dataset of question-answer pairs, to enable it to accurately answer questions posed in natural language. Along with these pairs, a passage is provided as a reference, from which the answer is extracted for the given question.
97 |
98 | A BERT model fine-tuned on question answering dataset could be used to answer such factual questions and providing the correct answer based on the context of the question. This has many potential real-world applications, such as in customer service chatbots or virtual assistants that can provide users with accurate and helpful responses to their questions.
99 |
100 | ignio leverages pre-trained transformers for usecases of different domains. Example usecases include: IT security domain to automatically fill up security surveys, legal domain to analyze contracts and NDAs to automatically flag acceptable and unacceptable clauses, extracting information from data sources to capture different aspects of enterprise context, and mapping trouble tickets to ignio’s automation catalog to identify tickets that can be auto-resolved by ignio.
101 |
102 |
103 | About Pushpam Punjabi, the author
104 |
105 | Pushpam Punjabi is a Machine Learning Engineer who develops solutions for the use cases emerging in the field of Natural Language Processing (NLP)/Natural Language Understanding (NLU). He enjoys learning the inner workings of any algorithm and how to implement it effectively to solve any of the posed problems.
106 |
107 |
--------------------------------------------------------------------------------
/Optimization/WhyOneMinuteManagers.txt:
--------------------------------------------------------------------------------
1 | Title: Why We Still Need One-Minute Managers | by Fabio Matricardi | May, 2023 | Better Programming
2 |
3 | How to reshape modern management, empower humanity, and unlock the hidden potential of the one-minute manager
4 | author is Fabio Matricardi
5 |
6 | Prologue
7 |
8 | In this article, we will look at modern management without forgetting the lesson learned by the famous bestseller “The One Minute Manager.”
9 |
10 | “One-Minute Manager”, the book
11 | The principles of the good manager
12 | To empower humans, you need to be an empowered human
13 | A checklist for the leaders we need now
14 | A new training workpack: key values for effective leadership
15 |
16 | Our working environment may be tough or soft. Our tasks and role may be easy or complex, boring or challenging. In any of these scenarios, one thing is in common: we have to deal with our line managers, and they have to deal with their manager… who also reports to the department head and so on, straight to the top in a never-ending chain.
17 |
18 | AI has not yet replaced management. In my working experience, this is still a neutral statement. Technology has advanced to the point where AI can help us manage our work but cannot replace human leadership. And this can be true only if a manager is evaluated for their humanity, for the way they live their values and apply them when dealing with the people they are leading.
19 |
20 | How can we make sure that such a manager is educated and trained? Let’s look at the book “The One Minute Manager.” It is a short book by Ken Blanchard and Spencer Johnson that, after its first publication, became a New York Times bestseller, sold 15 million copies, and has been translated into more than 40 languages.
21 |
22 | Photo by Andreas Klassen on Unsplash
23 | One-Minute Manager, the Book
24 |
25 | Ken Blanchard and Spencer Johnson introduced the concept of the One-Minute Manager in their book “The One Minute Manager” in 1982. The book became a bestseller and has remained popular ever since. The “One-Minute Manager” is a short and simple management style that focuses on three core principles: setting goals, praising progress, and redirecting behavior.
26 |
27 | The book tells the story of a bright young man looking for an effective manager. This young man searched for an effective manager for many years, traveling to small towns and capital cities. He spoke to multiple types of managers: government officials, military officers, corporation executives, shop foremen, and foundation directors, among others.
28 |
29 | He visited offices of all sizes and layouts and saw various management styles. However, he didn’t always agree with what he observed. He witnessed “tough” managers who achieved success while their people did not, though their superiors thought they were good managers.
30 |
31 | Sitting in the offices of the “tough people,” the man asked them, “What kind of manager are you?” Answers varied slightly: “Autocratic,” “Bottomline,” “Hard-nosed,” “Realistic,” and “Profit-minded.” He heard pride in their voices and a focus on results.
32 |
33 | He also met “nice” managers whose people were successful, but some of whom had their doubts. Responding to the same question, they said they were “Democratic,” “Participative,” “Supportive,” “Considerate,” and “Humanistic,” still with pride and a focus on people.
34 |
35 | It disturbed him that managers seemed to prioritize either results or people. He thought autocratic and democratic managers were only partly effective, like being “half a manager.”
36 |
37 | The young man had searched far and wide for an effective manager but was still waiting. His only advantage was knowing what to look for: one who managed the organization and people in a way that benefited both. He heard stories of a special manager in a nearby town who had great results with those he worked with and was curious. He contacted the manager’s secretary and was put through immediately. The manager agreed to see him anytime that week except Wednesday morning. Not expecting much, the young man went to see him.
38 |
39 | Photo by Vlad Hilitanu on Unsplash
40 | The Principles of the Good Manager
41 |
42 | I think everyone can easily identify people they know from the brief introduction: we all have met “tough” managers obsessed with profits and results; they’re often autocratic and bossy. We also have met (or we still work with) managers who are supportive and focused on their people. We know that none of them can be great leaders because they are only partially effective.
43 |
44 | While I suggest you read the book (it is very slim, easy to read, and has plenty of examples), I will describe the focal points. After that, I’ll provide some coordinates to navigate the book better.
45 |
46 | Overall, “The One-Minute Manager” provides practical advice on becoming an effective manager by focusing on people, setting clear goals, providing feedback, and continuously improving performance. In a few points, here are the best takeaways to becoming a better leader:
47 |
48 | Set One-Minute Goals: The book emphasizes the importance of setting clear, concise, and specific goals that can be accomplished in one minute. This approach helps to focus attention on what is important and provides a sense of achievement when goals are met.
49 | Give One-Minute Praisings: The authors advocate for giving immediate and sincere praise for good performance. They suggest that praise should be specific, timely, and focused on the behavior rather than the person.
50 | Provide One-Minute Reprimands: The book also discusses providing feedback when performance falls short of expectations. The authors suggest that reprimands should be immediate, specific, and focused on the behavior rather than the person.
51 | Empower your team: The book highlights the importance of empowering team members by delegating responsibilities and providing them with the resources they need to succeed. This approach helps to build confidence, motivation, and a sense of ownership.
52 | Build relationships: The authors emphasize the importance of building positive relationships with team members based on trust, respect, and open communication. This approach helps to foster a collaborative and supportive work environment.
53 | Continuously improve: The book encourages continuous improvement by setting higher goals and challenging oneself and the team to improve performance. This approach helps to drive innovation, growth, and development.
54 | To Empower Humans, You Need To Be an Empowered Human
55 |
56 | What surprises me is the waterfall effect in the promotion and assignment to the positions. Imagine what organizational plan an autocratic manager have: they will try to forge the system to their image and values (profit over people, praised by their superiors). At the same time, a democratic manager will mold the organization, hoping that every person is considered (people over profit and praised by the employees).
57 |
58 | Photo by Matt Collamer on Unsplash
59 |
60 | So there needs to be something fixed with the selection of the managers, but also with their training. If the only condition is an MBA or previous accomplishments in the company, the same mistakes will be repeated. But honestly, who can evaluate the values, ethics, and soft skills of another human?
61 |
62 | You need to be already a better human, an empowered one, with communication skills not based only on procedures and matrix, if you want to be able to evaluate future managers. It is the honesty and integrity you have as a person, not the techniques you use, that are essential for building trust and credibility with others.
63 |
64 | So how do we make sure managers are focused on the organization and people in a way that benefits both?
65 |
66 | Despite the passage of time, the principles of the “One-Minute Manager” are still relevant and useful in modern management, and they can help us solve the riddle (how to manage the organization’s and the people's needs).
67 |
68 | Here are a few reasons why we still need one-minute managers, those empowered humans who can manage both organization and people needs:
69 |
70 | The workforce is more diverse than ever, with different expectations and needs. One-minute managers can adapt to these differences and provide tailored management solutions.
71 | The pace of change is faster than ever before. One-minute managers can quickly adapt to changing circumstances and provide effective solutions.
72 | The competition is fiercer than ever before. One-minute managers can help organizations stay competitive by empowering their teams and continuously improving performance.
73 | The need for ethical and responsible management has never been greater. One-minute managers can lead by example and promote a culture of integrity and respect.
74 |
75 | Overall, the “One-Minute Manager” principles provide a roadmap to becoming an effective manager who can balance the organization's and people's needs.
76 |
77 | A Checklist for the Leaders We Need Now
78 |
79 | With the help of the book, let’s talk about some key points: we know from experience that this is what a good leader should do:
80 |
81 | Time efficiency: The one-minute manager’s approach allows managers to manage their employees quickly and efficiently without sacrificing results.
82 | Employee motivation: The One Minute Manager’s focus on praising progress can be a powerful motivator for employees. By giving employees regular feedback and recognition for their efforts, managers can help to boost their morale and job satisfaction.
83 | Clear Communication: The one-minute manager’s approach emphasizes clear and concise communication. By setting clear goals and expectations, and providing timely feedback, managers can help to ensure that everyone is on the same page and working towards the same objectives.
84 | Flexibility: The one-minute manager’s approach is adaptable to different management styles and situations. Managers can use the principles of the one-minute manager to manage employees in different departments, at different levels of experience, and with different personalities.
85 |
86 | In summary, the principles of the one-minute manager are still relevant today because they offer a time-efficient, motivating, clear, and flexible approach to management that can be adapted to different situations and styles.
87 |
88 | Photo by Austin Chan on Unsplash
89 | A New Training WorkPack: Key Values for Effective Leadership
90 |
91 | If an MBA is not enough, what do we need to focus on to train today's effective leaders (or at least of tomorrow)? We need a new curricula that include psychological studies and self-improvement, and we also need new teachers, people who can evaluate and guide them.
92 |
93 | I believe that these are the values and qualities required:
94 |
95 | Integrity: A leader must be honest, trustworthy, and transparent in their dealings with people.
96 | Empathy: A leader must understand and appreciate their employees' needs and respond with compassion and understanding.
97 | Vision: A leader must articulate a clear and compelling vision for the future and inspire people to work towards its achievement.
98 | Accountability: A leader must be accountable for their actions and decisions and willing to take responsibility for the outcomes.
99 | Collaboration: A leader must be able to work collaboratively with others and build strong relationships based on trust and respect.
100 | Innovation: A leader must be willing to think creatively. In today’s fast-paced business environment, managers don’t have much time to manage their employees.
101 |
--------------------------------------------------------------------------------
/GPT Agents and AGI.txt:
--------------------------------------------------------------------------------
1 | GPT Agents and AGI
2 | ⚠️ The Path of Auto GPTs leads to some added complications.
3 | MICHAEL SPENCER AND ZVI MOWSHOWITZ
4 | soure: https://substack.com/app-link/post?publication_id=396235&post_id=117160442&utm_source=post-email-title&isFreemail=true&token=eyJ1c2VyX2lkIjozMDU3MTI2MCwicG9zdF9pZCI6MTE3MTYwNDQyLCJpYXQiOjE2ODM3OTkyMTYsImV4cCI6MTY4NjM5MTIxNiwiaXNzIjoicHViLTM5NjIzNSIsInN1YiI6InBvc3QtcmVhY3Rpb24ifQ.izKWsJ-x5JAcJa2oGRNBIm4IioUl97cSRIs5oS7mnrs
5 |
6 |
7 | Today one of my favorite LessWrong contributors, Zvi, writes about AGI and GPT Agents. Luckily for us, he also writes on Substack.
8 |
9 | Zvi is a highly original thinker and analyst around A.I., including A.I. risk. LessWrong is a community blog and forum focused on discussion of cognitive biases, philosophy, psychology, economics, rationality, and artificial intelligence, among other topics. Zvi has over 550 posts there, many long and detailed. You can subscribe to his posts on the top right here.
10 |
11 | 🚨 LessWrong for me is an important sounding board on contemporary A.I. risk issues. If that topic is of interest, read this post to the end.
12 |
13 |
14 | Could future large language models (LLMs) become the key component of a future artificial general intelligence (AGI) that is smarter and more generally capable than humans?
15 | Could such an AGI then pose an existential threat to humanity?
16 | For current LLMs like GPT-4 (and ChatGPT) the answer is a definitive no.
17 | For a next-generation model, a GPT-5 worthy of the name, what about then?
18 | Probably not. Yet maybe.
19 |
20 | We cannot take much comfort in an LLM’s inability to plan, have goals or be an agent, because people are already building scaffolding to transform LLMs into agents that plan in order to achieve particular goals.
21 |
22 | We call such programs AutoGPTs.
23 |
24 | We also cannot presume that the goals chosen will be wise. Already we have ‘ChaosGPT’ whose ultimate goal is explicitly to destroy humanity. The most common goal people give such systems? “Make paperclips.”
25 |
26 | What Is AutoGPT?
27 | AutoGPT takes an LLM and turns it into an agent.
28 | The initial program was created by game designer Toran Bruce Richards.
29 | The concept works like this:
30 | This system uses GPT-4 in three distinct places.
31 | GPT-4 attempts to execute individual tasks and subtasks directly.
32 | GPT-4 generates new subtasks for tasks it cannot execute directly.
33 | GPT-4 prioritizes among its tasks and subtasks.
34 |
35 | The AutoGPT program tracks what tasks are in the queue, and has memory that provides context for task execution, creation and (presumably also) prioritization. When tasks or subtasks fail, AutoGPT evaluates the situation and attempts to course correct.
36 |
37 | Plug-ins including internet browsing are used to execute tasks. Often AutoGPT is given disk read/write access.
38 |
39 | AutoGPT quickly became #1 on GitHub. Lots of people are super excited. Many are building tools for it. There is a bitcoin wallet interaction available if you never liked your bitcoins. AI agents offer very obvious promise, both in terms of mundane utility via being able to create and execute multi-step plans to do your market research and anything else you might want, and in terms of potentially being a path to AGI and getting us all killed.
40 |
41 | Did We See AutoGPTs Coming?
42 | As with all such new developments, we have people saying it was inevitable and they knew it would happen all along, and others that are surprised. We have people excited by future possibilities, others not impressed because the current versions haven’t done much. Some see the potential, others the potential for big trouble, others both.
43 |
44 | Also as per standard procedure, we should expect rapid improvements over time, both in terms of usability and underlying capabilities. There are any number of obvious low-hanging-fruit improvements available.
45 |
46 | An example is that many people have noted ‘you have to keep an eye on it to ensure it is not caught in a loop.’ That’s easy enough to fix. Similarly, it seems easy to get rapid improvement on many tasks using special logic. There is great potential in using specialized fine tuned models for appropriate subtasks, or for different steps of the process.
47 |
48 | A common complaint is lack of focus and tendency to end up distracted. Again, the obvious things have not yet been tried to mitigate this. We don’t know how effective they will be. No doubt they will at least help somewhat.
49 |
50 | It is still very early. For now, AutoGPTs are not ready for prime time. We got a quick flurry of ‘look at what AutoGPT can do.’ What we did not get was any ability to keep them on track, or to get them to execute complex tasks in creative ways.
51 |
52 | Evolved forms of AutoGPT hold great promise. For now, they do not offer much mundane utility.
53 |
54 | What Will Future Versions Do?
55 | Future versions, even based on GPT-4, will have better memories, better focus, better self-reflection, better plug-ins, better prioritization algorithms, better ways for humans to steer while things are progressing, better monitoring of sub-tasks, better UIs (e.g. there is already a browser-based AgentGPT), better configurability, better prompt engineering and so on.
56 |
57 | This will happen even with zero fundamental innovations or other AI advances, and without any insights into AutoGPT design.
58 |
59 | This process will rapidly shed a light on what things are actually hard for the underlying LLM to handle, and which things only require the right scaffolding.It seems reasonable to expect a number of large step-jumps in capability for such systems as various parts improve.
60 |
61 | AutoGPT needs to reduce its tasks into subtasks that GPT-4 can complete. As we get the ability to do larger subtasks reliably, or to figure out how to define subtasks such that they can be completed and accomplish larger tasks such that the new larger task effectively counts as a completable subtask, new possibilities open up. If you know what you want a particular such AutoGPT to do, you can create special checks and logic to expand upon this.
62 | Woe to those who don’t extrapolate here.
63 | Crossing the Threshold
64 | What might happen in the future if we had superior agent-generating architecture, and hooked it up to something worthy of the name GPT-5, or 6 or 7?
65 |
66 | Perhaps remarkably little, if such systems continue to turn out to lack key elements. If they don’t, perhaps the end of everything.
67 |
68 | A language model, in order to best predict text, needs to understand the systems of the world. A sufficiently advanced LLM will, upon request, be able to plan well, and to solve many subtasks.
69 |
70 | At some point, your agent becomes sufficiently capable and robust to accomplish complex multi-step goals. Various humans will then point it at various goals. Some of these goals will be maximalist goals, like ‘make the most money’ or ‘world peace.’ Some will be actively destructive.
71 |
72 | Such an agent, if sufficiently capable already, and which may or may not already be a full AGI, will then generate a subtask to expand its capabilities - to seek some combination of intelligence, skills, power, compute, money, information and so on. If it is capable of doing that, it will then use that new leverage to seek out yet more, and repeat in a cycle of recursive self-improvement and self-empowerment. While doing so, it will also generate subtasks like ‘prevent anyone from turning the system off or modifying the goal’ because those events would stop it from achieving its goal.
73 |
74 | That’s instrumental convergence. What helps you do almost anything? Power.
75 |
76 | Where does this end, once it starts? It might not stop at all, until this progressively smarter and more powerful monster we created takes over everything.
77 |
78 | Such systems could easily go very quickly from ‘not good enough to snowball’ to existential threats once they get started. By then, it would be too late.
79 |
80 | The Agent Overhang
81 | If agents are dangerous, should we avoid and discourage agents and agent scaffolding? Or should we encourage development of better agent scaffolding, to avoid having an ‘agent overhang’ of easily accessible new capabilities that will inevitably be tapped at a worse time?
82 |
83 | Future agents that take AutoGPT form, which potentially are also future AGIs, consist of an LLM and scaffolding around the LLM, that in combination are sufficiently capable.
84 |
85 | If we want to prevent future agents from reaching a critical mass of capabilities until we can ensure control over their behavior, and avoid them falling into the hands of those who would give out destructive goals, we can either target the scaffolding or we can target the LLM itself.
86 |
87 | The problem with targeting the scaffolding is that the scaffolding is a tiny Python program. Future versions will be longer and more complex with more components, but attempting to prevent their development over time seems hopeless. Whatever the LLMs we create can do, humans will find a way to do it, no matter how foolish that might be.
88 |
89 | If we are in danger, we will find little hope in either humans not creating strong scaffolding, or in convincing all humans to always choose their prompts wisely.
90 |
91 | AutoGPTs are also typically open source. Open source software, once it gets out into the world, cannot be controlled or contained, and it cannot be protected from modification. Its safety features are only as good as people’s willingness to not remove them.
92 |
93 | If we allow this ‘Linux moment’ to spread to future more powerful base models, in the style of Meta’s Llama, then no one will have the power to stop what happens next.
94 |
95 | We can mainly find hope in controlling the creation of future more powerful models, or in limiting access to those models, so that future agents lack the potential capabilities necessary to be sufficiently dangerous.
96 |
97 | To achieve this, it is helpful to know what would constitute a dangerous model. If we know approximately where the threshold is likely to be, we have better chances to avoid crossing that threshold.
98 |
99 | It is also helpful now to have models that are dangerous now, in ways that are not existential and will not be so lethal or expensive. This alerts more people to the dangers future models will pose, and allows a better case that we need to beware the creation of such models. It also allows us to develop defenses and mitigation strategies and detection methods, either by modifying the ‘standard’ code we use to make agents (while remembering anyone can remove or disable such safety codes, and some will try) or by learning how to change the internet, tech landscape or world to provide incremental protections.
100 |
101 | This is a strong argument in favor of encouraging development of more agents and agent scaffolding.
102 |
103 | There are also strong arguments pushing against this.
104 |
105 | The argument for solving the overhang implies we will inevitably find the correct scaffolding architectures. This is not obvious. Perhaps with less practical experience and tinkering, we will not do so, in which case the overhang might persist indefinitely, and we don’t want to close it. Or perhaps we want to wait until regulation is more set, so it isn’t structured around agents.
106 |
107 | Getting people into the habit of using and being excited by agents, and hooking agents up to things, could make the world more rather than less vulnerable for future agents, if we continue to do so as foolishly as we have so far. Precautions in response to mishaps might instill false confidence.
108 |
109 | The more we develop better agents, the more this pushes general AI capabilities towards dangerous levels faster, as AI will get even more attention and funding, and the agents themselves could help with development even before they are existentially dangerous on their own.
110 |
111 | On balance, I continue to favor shrinking the agent overhang, but this could easily be wrong.
112 | The Core Problem
113 | Geoffrey Hinton, the Godfather of AI, cuts to the heart of the problem.
114 |
115 | He warns, how often does a less smart thing stay under the control of a much smarter thing?
116 |
117 | Such a situation is not impossible to sustain, yet it is extremely unlikely and difficult.
118 |
119 | Humans control the world and the future because humans are the smartest entities and the most powerful optimizers on the planet. Increasingly the world’s atoms are arranged how we want them arranged, limited only by our lack of further intelligence, technology and optimization power.
120 |
121 | What happens if we hand those titles over to one or more AGIs, which would then likely rapidly be quite a lot smarter and quite a lot better at optimizing how atoms are arranged? That are assigned various goals by various people that cause them to seek power? That are better at almost all skills, including manipulating humans, than any human has ever been?
122 |
123 | The future would then rapidly belong to those AGIs. If we have not robustly solved the alignment problem by then, and figured out how to make those AGIs collectively give us a future we value, we will likely not have a place in that future, or much value what comes after us, for long.
124 |
125 | GPT agents show us that ‘they aren’t agents’ is at most a temporary stumbling block towards this outcome. Finding a solution is humanity’s greatest challenge.
126 |
--------------------------------------------------------------------------------
/largeModelsOrGoodData.txt:
--------------------------------------------------------------------------------
1 | Title: Dear Sam Altman- There was never an era of making models bigger
2 | Author: Devansh- Machine Learning Made Simple
3 |
4 | Recently, the internet caught fire with a particular admission from Sam Altman- The Era of Large Language Models is over. According to this report by Wired-
5 |
6 | But the company’s CEO, Sam Altman, says further progress will not come from making models bigger. “I think we’re at the end of the era where it’s going to be these, like, giant, giant models,” he told an audience at an event held at MIT late last week. “We’ll make them better in other ways.”
7 |
8 | - Article- OpenAI’s CEO Says the Age of Giant AI Models Is Already Over
9 |
10 | This has caused a notable stir in the online space. We have seen a notable increase in the number of AI Experts (specifically GPT Experts) since November 2022. These GPT/LLM warriors have been promising AGI and 100x productivity-based AI based on how adding trillions of parameters and more data might be coming to an end sooner than people realize. Looks like Alex Hormozi can live a peaceful existence now-
11 |
12 | In this article, I’m here to argue something that would be considered blasphemy to many people in the AI Space- the age of mindlessly scaling models has never been here. When we look at the data and actually compare the results- it has always been clear that throwing more and more data and increasing parameter size was always doomed to fail- long before we started hitting the scaling limits that GPT-4 is starting to hit. By understanding how we could have foreseen this problem, we can avoid making these mistakes in the future- saving everyone a lot of time, money, and attention.
13 |
14 | OpenAI, the AI research company behind popular language models like ChatGPT and Dall-E 2, has reportedly doubled its losses to $540 million in 2022 due to soaring development expenses for its chatbot. The company is now looking to raise as much as $100 billion in the coming years to fund its goal of developing artificial general intelligence (AGI), an AI advanced enough to improve its own capabilities.
15 |
16 | -Source. The hype around these models is causing a major bubble. Make sure you don’t get caught up
17 |
18 | Sound like a good time? Let’s get right into it.
19 |
20 | Image Source
21 | Saturation of Benchmarks
22 |
23 | One of the most important cornerstones of the hype behind AI was the performance increases that LLMs hit with multiple benchmarks. Every week, we found that GPT/a new Language Model was able to match/beat SOTA performance on a new benchmark or task. Who can forget the hype that we hit when we learned that GPT can pass the Bar Exam and even act as a doctor?
24 |
25 | This led to a lot of speculation on AGI and how these models were developing so-called emergent abilities. However, peering behind the hood tells you a very different story.
26 |
27 | In earlier years, people were improving significantly on the past year’s state of the art or best performance. This year across the majority of the benchmarks, we saw minimal progress to the point we decided not to include some in the report. For example, the best image classification system on ImageNet in 2021 had an accuracy rate of 91%; 2022 saw only a 0.1 percentage point improvement.
28 |
29 | -Source
30 |
31 | People involved in deep learning for a while will know an uncomfortable truth- AI performance has been increasingly saturated for a few years, way before we had investors and social media influencers blindly pushing the hype behind large language models. Machine Learning Researchers have burned more and more computation for increasingly smaller gains (sometimes under a percentage point).
32 |
33 | AI continued to post state-of-the-art results, but year-over-year improvement on many benchmarks continues to be marginal. Moreover, the speed at which benchmark saturation is being reached is increasing.
34 |
35 | - Stanford AI Index Report 2023
36 |
37 | Seen from this perspective, you should be less excited about these so-called amazing architectures. When it comes to performance- sure they can hit benchmarks- but at what cost? Deploying these models at any kind of scale would have you running out of computing budgets quicker than Haaland breaking goal-scoring records. Don’t forget, the scale that makes these models powerful also makes them extremely expensive to deploy in contexts where you have to make a lot of inferences ( Amazon Web Services estimates that “In deep learning applications, inference accounts for up to 90% of total operational costs”.).
38 |
39 | According to researchers who wrote, Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model[3], it took roughly 50.5 tons of CO2 equivalent to train the large-language model BLOOM. GPT-3 released over 500 tons of CO2 equivalent.
40 |
41 | - Once you’re done with article, check out my article on the ethics of Copilot
42 |
43 | However, that is far from the only issue that made GenAI far less promising than the internet would have led you to believe. Let’s now cover something that would come as a surprise to a lot of people who have graduated from ChatGPT University.
44 |
45 | Performance
46 |
47 | Here’s something that would surprise you if you only read about GPT from online influencers telling you how to get ahead of 99% of people- these models are just not very good. When it comes to practically implementing Large Language Models into systems that are useful, efficient, and safe- these structures fall apart.
48 |
49 | When it comes to business use cases, these models are often lose to very simple models. The authors of “A Comparison of SVM against Pre-trained Language Models (PLMs) for Text Classification Tasks” compared the performance of LLMs with a puny SVM for text classification in various specialized business contexts. They fine-tuned and used the following models-
50 |
51 | As I’ve covered here, these are the most popular models used in ML Engineering.
52 |
53 | These models were stacked against an SVM and some old-fashioned feature engineering. The results are in the following table-
54 |
55 | As you can see, SVMs match the performance of these large models. This is a huge win for them, given the much higher costs associated with the bigger models. Keep in mind, text classification is one of the core functions of these bigger models.
56 |
57 | This extends beyond just text classification. Fast AI has an exceptional write-up investigating Github Copilot. One of their stand-out insights was- “According to OpenAI’s paper, Codex only gives the correct answer 29% of the time. And, as we’ve seen, the code it writes is generally poorly refactored and fails to take full advantage of existing solutions (even when they’re in Python’s standard library).” Just to drill home how overrated these models can be for coding, here are some of the problems with these models that the amazing Luca Rossi explored in his insightful AI & The Future of Coding 🤖. In it, he attempted to create the following with in Node-
58 |
59 | Write a Telegram bot that answers my questions like ChatGPT, using OpenAI API
60 |
61 | -a relatively simple task, for any competen t developer
62 |
63 | The code generated in Node didn’t work. Below is Luca’s analysis of the situation- It turns out, the AI used methods that do not exist in the Node library, probably inspired by the Python ones.
64 |
65 | Below is Luca’s experience with Debugging-
66 |
67 | I replaced them and things worked fine. Two considerations:
68 |
69 | Debugging the code required me to study the openai and telegraf libraries, undoing 90% of the benefit of using the AI in the first place.
70 | Debugging was surprisingly hard, because these were not the kind of mistakes a normal person makes. When we debug, our brain is wired for looking at things we are more likely to get wrong. When you debug AI code, instead, literally anything can be wrong (maybe over time we will figure out the most common AI mistakes), which makes the work harder. In this case, AI completely made a method up — which is not something people usually do.
71 |
72 | Along with this, Luca has some great insights into how using AI Coders would cause a lot of duplication of work in bug fixing and system design, would destroy innovation, use outdated methods, and a few other concerns. Would highly recommend checking out his work here-
73 |
74 | AI & The Future of Coding 🤖
75 | Hey! Let's get something out of the way: this article has not been written by an AI. I know, it sucks. I am sorry…
76 |
77 | refactoring.fm
78 |
79 | Let’s beyond ChatGPT and onto the allegedly early signs of AGI that were discovered in Microsoft’s 155 Page report- Sparks of Artificial General Intelligence: Early experiments with GPT-4. In it, we GPT-4 failing at some relatively simple tasks when it comes to a very simple task-
80 |
81 | We see GPT-4 adding new information to the note, even though the prompt explicitly states- using exclusively the information above (and keep in mind this is Microsoft’s hype piece on GPT-4, so we don’t see the real disasters). Show this to the people that want to use GPT for doctors. If these systems are implemented recklessly, people will suffer. Regular readers know that I prefer not making sensational statements, but this is not me click-baiting you about AI’s existential threat. This is me telling you about a very real issue with these systems. If you’d like to read more about my investigation into GPT-4 and understand how the writers of the GPT-4 report ignored very real problems with their claims of AGI- read Observations on Microsoft’s Experiments with GPT-4.
82 |
83 | Data-Centric AI
84 |
85 | Prior to this ‘bombshell’ admission and GenAI capturing the attention of everyone, there was another trendy buzzword that was moving through the Data Field- ‘Data-Centric AI’. The premise was relatively- we had spent a lot of time building better models and not enough time on improving our data transformation processes. Take a look at this quote from Andrew Ng at an event by MIT-
86 |
87 | AI systems need both code and data, and “all that progress in algorithms means it’s actually time to spend more time on the data,” Ng said at the recent EmTech Digital conference hosted by MIT Technology Review.
88 |
89 | Once you’re done being wowed by the big names in the statement, tell me what part of that is truly surprising. It’s been well-known that data processing and curation is the most important part of the pipeline since the beginning. The reason that we saw this giant emphasis on models in research was two-fold-
90 |
91 | We have big benchmarks/datasets (ImageNet for eg) for many of the standard tasks. In this case, the problem was in the architectures themselves. By standardizing datasets, we could test the effectiveness of various changes to the training pipelines.
92 | Streetlight Effect- The streetlight effect, or the drunkard’s search principle, is a type of observational bias that occurs when people only search for something where it is easiest to look. Combining this with the confirmation bias we see the mess of mindless scaling model size without attention given to other factors- people simply funding/copying whatever is already being done (in our case tweaking architectures). If all you see is working focused on changing architectures, then people will likely create work along a similar vein.
93 |
94 | But the unsustainable nature of simply scaling up architectures has been well known to anyone who even remotely understands the field. Once you step outside the bubble of ML Academia and Big Tech AI- you will see that organizations lack the expertise/budgets/inclination to invest into implementing large models because the ROIs are not worth it.
95 |
96 | We additionally do not utilize GPUs for inference in production. At our scale, outfitting each machine with one or more top-class GPUs would be prohibitively expensive, … that our models are relatively small compared to state-of-the-art models in other areas of deep learning (such as computer vision or natural language processing), we consider our approach much more economical.
97 |
98 | -This is from a writeup by the engineers at Zemanta, a leading advertising firm. Even at their level, SOTA DL isn’t really worth it. To learn more read- How to handle 300 million predictions per second over here
99 |
100 | This is why we have seen the move away from bigger models and obscenely big data sets and instead focus on more intelligent design. Intelligent design was explicitly mentioned as one of the reasons Open Source was beating big companies like Google and Open in the infamous ‘Google has no Moat’ document-
101 |
102 | Open-Source companies are doing things in weeks with $100 and 13B params that we struggle with at $10M and 540B.
103 |
104 | -Read my analysis of the situation here
105 |
106 | Closing
107 |
108 | Combining this together, here is a question I want to ask you- when exactly did we have an era of scaling up? At what point was it ever a good idea to implement more and more scaling- as opposed to focusing on better data selection, using ensembles/mixture of experts to reduce errors, or constraining the system to handle certain kinds of problems to avoid errors? People have been calling this unsustainable for a long time, way before general-purpose LLMs were a mainstay in ML.
109 |
110 | Rather than an insightful statement about the future of AI, Sam Altman’s statement is a face-saving maneuver. As is becoming clear- the market is becoming increasingly saturated and OpenAI has no clear direction for making money. Even if they did corner the market, they can’t raise prices because customers can always switch to open-source models (and this assumes that the OpenAI models are meaningfully better- which is debatable). The bubble is bursting, the chickens have come home to roost, and this statement is Sam Altman is now scrambling to maintain the facade that these problems are under control.
111 |
112 | That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.
113 |
--------------------------------------------------------------------------------
/Optimization/understanding-automatic-text-summarization-2.txt:
--------------------------------------------------------------------------------
1 | Title: Understanding Automatic Text Summarization-2: Abstractive Methods | by Abhijit Roy | Towards Data Science
2 | --------------------------------------------
3 | source: https://towardsdatascience.com/understanding-automatic-text-summarization-2-abstractive-methods-7099fa8656fe
4 |
5 | Understanding Automatic Text Summarization-2: Abstractive Methods
6 | How can we use deep learning to summarize the text?
7 |
8 | author is Abhijit Roy
9 |
10 |
11 | This is my second article on Text summarization. In my first article, I have talked about the extractive approaches to summarize text and the metrics used. In this article, we are going to talk about abstractive summarization. We are going to see how deep learning can be used to summarize the text. So, let’s dive in.
12 |
13 | Abstractive Summarizers
14 |
15 | Abstractive summarizers are so-called because they do not select sentences from the originally given text passage to create the summary. Instead, they produce a paraphrasing of the main contents of the given text, using a vocabulary set different from the original document. This is very similar to what we as humans do, to summarize. We create a semantic representation of the document in our brains. We then pick words from our general vocabulary (the words we commonly use) that fit in the semantics, to create a short summary that represents all the points of the actual document. As you may notice, developing this kind of summarizer may be difficult as they would need the Natural Language Generation. Let’s look at the most used approach to the problem.
16 |
17 | Application of sequence-to-sequence RNNs
18 |
19 | The approach was proposed in a paper by Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, Bing Xiang from IBM. The term “sequence to sequence models” is used because the models are designed to create an output sequence of words from an input sequence of words. The input sequence in the considered case is the actual text document and the output sequence is the shortened summary.
20 |
21 | The paper proposes a model inspired by an attentional Recurrent Neural Network encoder-decoder model which was first proposed for machine translation by Dzmitry Bahdanau, Jacob’s University, Germany.
22 |
23 | Though, the problems are a lot different as you can already sense. Firstly for machine translation, we need the translation to be loss-less as we want the exact sentence in a translated form, but for Summary Generation, we need to compress the original document, to create the summary, so it needs to be a bit lossy. Secondly, for a summary generation, the length of the summary does not depend on the original text. These two points are the key challenges in the problem as given by the problem.
24 |
25 | Before jumping into the application details of the paper let’s look at the encoder and decoder networks and the reason for using the attention layer.
26 |
27 | Encoder and Decoder Networks
28 |
29 | If we consider a general LSTM( Long term short memory) layer, it looks something like the diagram given below. It either produces an output for every input or it creates a feature vector, which is later used by dense neural network layers for classification tasks with the application of softmax layers. For example, sentiment detection, where we pass the whole sentence through RNN and use the feature vectors which are fit to softmax layers for producing the ultimate result.
30 |
31 | But one thing to realize here is, for the current problem, or problems like this including machine translation, more generally speaking, where we can say the problem is a sequence to sequence problem, this particular model approach cannot be applied. The main reason is that the size of the output is independent of the size of the input and both are sequences. To deal with this problem the encoder-decoder network model was introduced.
32 |
33 | The model basic architecture is described by the above diagram.
34 |
35 | The encoder is responsible for taking in the input sentence or original document and generate a final state vector(hidden state and cell state). This is represented by the internal state in the diagram. The Encoder may contain LSTM layers, RNN, or GRU layers. Mostly LSTM layers are used due to the removed Exploding and vanishing gradient problem.
36 |
37 | The above diagram shows our encoder network. In the encoder network, one word is fed at a time step, and finally, after the nth input word is fed to the LSTM layer the hidden state and cell states become our final state or feature vector. The cell state Cn and the hidden state Hn are sent to the first set of the LSTM layer of the decoder.
38 |
39 | This is how our decoder model looks. Now the first layer receives the inputs from the encoder’s final states that are the hidden and cell states activations. The decoder model takes in the inputs and generates the predicted words of the output sequence, given the previous word generated. So, for LSTM at time step 1 for the decoder has 0 vector input, and Y1 is the predicted word generated, for time step 2, Y1 is fed to the LSTM layer as input and Y2 is the generated word and so on. The decoder generates the words time steps by time steps until the tag is faced.
40 |
41 | This might raise a question, how are the words generated?. Well, here is the answer. The encoder-decoder model is trained on a target set or vocabulary of words. Now, in each step, of the decoder, the LSTMs hidden activation is sent through a softmax layer which generates the probabilities for each word in the vocabulary to be predicted as the next word. The word with the maximum probability is chosen as the output at that time step. Now, how does the model know which word exactly suits the semantics? For this, the model trains on a dataset and transforms the problem into a supervised classification problem. Alongside, the models, usually use word embeddings of the words in the vocabulary from well-known embedding vectors like the word2vec by google or the Glove by Standford NLP. The word embeddings help to gain several insights about the word like whether a given word is similar to a given word or not. Some times TFIDF vectorizations are also used to generate meaningfulness of context words.
42 |
43 | Let’s go by example. Say we have a dataset, which has a collection of long-form reports and their human summaries. The information can be used as labels and targets for training our encoder-decoder networks. We will vectorize our labels and targets, form a vocabulary. Next, we will pick the embeddings up from word2vec or Glove for the words in our vocabulary and then fit the labels and targets to our model for training.
44 |
45 | But this particular summarization problem has an issue. The original document can be very large. Say the document is of 100 lines. Now, when we, as humans, are summarizing from 1–5 lines, shall we need to consider the 100th line? No right? When we summarize manually we give more attention to the lines 1–5 when we are summarizing those lines, and slowly move forward. This aspect could not be implemented by the normal encoder-decoder model, So, the attention mechanism was introduced.
46 |
47 | Attention Mechanism
48 |
49 | The mechanism aims to focus on some specific sequences from the input only, rather than the entire input sequence to predict a word. This approach is very similar to the human approach and seems to solve the problem.
50 |
51 | Source
52 |
53 | The above diagram shows the implementation of the attention layer by TensorFlow. We can see the for predicting a word each word in the input sequence is being assigned a weight, called the attention weights. The summation of the vectorized attention weights is used to form the context vectors which are used for prediction.
54 |
55 | Let us examine in detail. The paper proposes the encoder to be composed of a bi-directional GRU RNN and the decoder will have unidirectional GRU RNN. Both the encoder and decoder will have the same number of hidden layers and units.
56 |
57 | Let’s see the attention mechanism here.
58 |
59 | The upper green layer shows the decoder, the Y1 and Y2 are the time step outputs. X1, X2 are the inputs to the encoder. “af0” is the forward activation of input 0 and “ab0” is the backward activation of the timestep 0 and so on.
60 |
61 | Let’s say ‘an’ is (‘abn’, ‘afn’) combined activation at timestep n, so, a0=(af0,ab0), i.e, both forward and backward combined. “Wmn” determines how much weight should be given to the input at timestep m while predicting the output at time step n of the decoder.
62 |
63 | The Context vector is the weighted sum of the attention weights. It is given by:
64 |
65 | C< n >= Sum(W.a) for all m in the input time steps.
66 |
67 | The above equation states context vector for time step n is equal to the sum of weights given to input at timestep m and the combined activation of the timestep m, for all m timesteps in the input i.e, x time steps if there are x words in the input.
68 |
69 | Now, how are the words obtained?
70 |
71 | W= softmax(e) for all m
72 |
73 | Where we obtain e using a neural network with a single hidden state. The network takes in S(n-1) and a(m) (Combined activation of timestep m) and gives e(m,n) as output. The neural network for obtaining ‘e’ is also trained during the training of our encoder-decoder model.
74 |
75 | We use softmax, so the sums of all the weights assigned to all the input timesteps are always equal to 1.
76 |
77 | This is the overall mechanism of the attention model.
78 |
79 | Large Vocabulary Trick
80 |
81 | We have discussed previously, the words predicted by the decoder are generated using a softmax. Now, if we use all the words in our vocabulary as our target word set, the softmax will have a huge number of output nodes, and the prediction will be computationally inefficient. The paper proposes the Large Vocabulary trick proposed by Sebastien Jean, University of Montreal to solve this issue. The model is trained in mini-batches. The target word set or the decoder-vocabulary of each minibatch is restricted to the source document of that particular mini-batch. Now, if we use different subsets of decoder-vocabulary for each mini-batch, there is a chance that the target vocabulary becomes unequal in length, which is very obvious, as different samples have different numbers of lines and a different number of words. So, we add the most frequent words in the total vocabulary to the subset vocabularies to make them of fixed size. This decreases the time requirements and fastens the convergence as with the decrease in target set size the softmax layer also shortens.
82 |
83 | Extraction of keywords
84 |
85 | Previously, when we discussed the encoder-decoder model, I had mentioned that we normally use word embeddings to represent the words in the documents after vectorization. Now, let’s try to think about what we actually need the words to indicate in order to summarize. We will realize that the embeddings are not enough, because for summarization we need to focus on the context and the keywords in the piece of text. The embeddings help to get a general idea about a word but it has nothing to do with the context of the text. So, the paper proposed to take into consideration factors like part of speech tags, named-entity tags, and TFIDF statistics of a word alongside embeddings to represent a word. We convert the continuous TFIDF values into categorical value, using bins. Finally, we take all of the features and embeddings for a word and create a new embedding for the words. So, basically the TFIDF, POS tags gives us an idea about how important are the words are in the context of the document and the word embeddings give a general idea about the word. Next, we concatenate them into a single long vector and feed to network. One thing to notice is, we use only word embeddings to represent the words in the target side.
86 |
87 | Source
88 |
89 | Next, We will look at another very important aspect proposed by the paper.
90 |
91 | Switching Generator-Pointer
92 |
93 | In Natural language handling when we train a model using a supervised model, we often need to handle some words which are not present in our vocabulary. Such words are termed as OOV or out-of-vocabulary. In normal cases, we handle them using ‘UNK’ tags. But in the case of summarizations, it will not be correct because the words may carry some significance in the summary. The paper proposes a switching decoder pointer to handle such cases. T each time step of the decoder, there is a pointer maintained to the input text. Whenever the decoder faces an OOV term, it points to a term in the input and uses the term from the input text directly. So, the decoder has basically two actions at a time step, it can generate a word from the target dictionary or it can point and copy a word. This decision is taken using a switch If the switch is turned on it generates a word else it copies a word from the input.
94 |
95 | Source
96 |
97 | Now, the question is how the switch operates? The switch is a sigmoid activation function over the entire context vector at a particular decoder time step. It is given by:
98 |
99 | Source
100 |
101 | “where P(si = 1) is the probability of the switch turning on at i th time-step of the decoder, hi is the hidden state, E[oi−1] is the embedding vector of the emission from the previous time step, ci is the attention-weighted context vector, and Ws h, Ws e, Ws c, bs and vs are the switch parameters”-Source
102 |
103 | The pointer at each time step must point to a word in order to copy the word, this is decided based on the attention weights distribution for that decoder time stamps.
104 |
105 | Source
106 |
107 | “In the above equation, pi is the pointer value at i th word-position in the summary, sampled from the attention distribution Pa i over the document word-positions j ∈ {1, . . . , Nd}, where P a i (j) is the probability of i th time-step in the decoder pointing to the j th position in the document, and h d j is the encoder’s hidden state at position j”-Source
108 |
109 | The switch is also trained during the training of the neural network. The function given below is optimized during the training.
110 |
111 | Source
112 |
113 | “where y and x are the summary and document words respectively, gi is an indicator function”-Source. The indicator function is set to 0 whenever an OOV is faced with respect to the decoder vocabulary. This turns off the switch and a word is copied from the input text.
114 |
115 | This is an overview of the sequence-to-sequence model for text summarization. I encourage you to go through the references for more implementational details.
116 |
117 | One thing to note is, the authors proposed hierarchical Attention layers for long documents. If the documents are very very long, we some times may need to identify the key sentences with the keywords. For this, we need a hierarchical attention mechanism. One level to take care of the importance of the sentence and another level for the importance of the words. The two attention layers operate at the two levels simultaneously.
118 |
119 | Source
120 |
121 | The other two most notable approaches are:
122 |
123 | Facebook’s model
124 |
125 | This approach was proposed by Alexander Rush, Facebook AI Research in 2015.
126 |
127 | Source
128 |
129 | The above diagram describes facebook’s model. It has three encoders:
130 |
131 | A bag-of-words encoder: It just uses a bag of words representation of the input sentence, and ignores the relationships with the neighboring words. The decoder takes in the encoded vector or bag of words and predicts a word at the time step.
132 | Convolutional encoder: Convolutional layers are used to generate feature vectors from the word embeddings of the input vectors and then decoders are used to create the words at the time step
133 | Attention-based encoder: This encoder words on the attention layer RNN as we discussed in the previous approach.
134 |
135 | Lastly, a beam search is applied to the results in order to obtain the summarized text.
136 |
137 | Pointer-generator model by Google
138 |
139 | This model was proposed by Abigii See from Standford university.
140 |
141 | Source
142 |
143 | I felt this model is similar to IBM’s model, but the model uses a coverage mechanism to reduce the repetitions problem of the sequence-to-sequence network.
144 |
145 | You can go through the papers for more details. I will provide the links in the reference.
146 |
147 | Conclusion
148 |
149 | In this article, we talked about several approaches by which deep learning is used for text summarization.
150 |
--------------------------------------------------------------------------------
/semanticsearcha_praticaloverview.txt:
--------------------------------------------------------------------------------
1 | Title: Semantic search: a practical overview | ML6team
2 | ------------------------------------------------------
3 |
4 | Semantic search: a practical overview
5 |
6 | Author is Mathias Leys
7 |
8 |
9 | Introduction
10 |
11 | One of the main challenges in leveraging the value of textual data lies in the unstructured nature of natural language. This data format makes sense to us humans but is very tricky to process in an automated way. However, despite this challenge, an estimated 80–90% of the data in an enterprise is text [1].
12 |
13 | The potential added value that leveraging this data source could have, should not be underestimated. For example, lawyers on average spend 11.2 hours per week on tasks related to information retrieval [2]. One potential way to make use of this data is by means of an effective search system.
14 |
15 | - Imagine if you could effectively search through millions of pdf files full of legal text to find that one law that applies to your situation.
16 | - Imagine if you could instantly browse through massive amounts of technical manuals to find the paragraph that exactly explains how to solve the problem you’re facing.
17 |
18 | Sounds useful doesn’t it?
19 |
20 | This is where Information Retrieval, and in particular Semantic Search, comes in which is exactly the branch of NLP (Natural Language Processing) that aims to tackle this problem. In this blogpost, we will investigate this topic from a very practical viewpoint. So if you don’t fancy yourself as a mathematician or an expert programmer, don’t click away just yet.
21 |
22 | The goal here is to get an intuition for what semantic search is and which challenges you will likely face when implementing a semantic search engine in a practical setting. At the end you should have a solid view on (1) what semantic search is, (2) when to use semantic search and (3) how to approach implementing a semantic search engine.
23 |
24 | We will not cover the technical aspects in-depth here but we will post another blogpost that deals with the same topics but from a more technical standpoint. The focus here will be much more on the “Why” than on the “How”.
25 |
26 | What is Semantic Search?
27 |
28 | Let’s not get too ahead of ourselves just yet and briefly discuss what semantic search is exactly and how it works. Essentially, semantic search attempts to match queries to relevant documents based on meaning rather than on pure syntax. In practice, the main advantages of such an approach are twofold:
29 |
30 | (1) Exact wording becomes less relevant 📝
31 | Imagine you work for a real-estate company and have a database full of articles about properties. Say you are looking for the article that is about a specific house you have in mind so you form a query that contains the information you remember about this house. You would likely end up with a query that looks something like this: “a two bedroom house in Los Angeles”. Let’s now say that the article you were looking for refers to this property as “a residence with 2 rooms in sunny California”.
32 |
33 | Well, if you were using a non-semantic search engine, you wouldn’t find this result simply because your query shares no common words with the result you were after. An article about a property that is described as “a two story house in Los Angeles” would, for example, be deemed much more relevant since this description shares many common words with your original query even though there is an important difference in meaning between “a two bedroom house” and “a two story house”.
34 |
35 | Exact wording matters a lot in a lexical search
36 |
37 | However, since a semantic search engine deals with meaning rather than syntax, it would for instance recognize that “residence” and “house”, “Los Angeles” and “California”, etc. are closely related and you would likely find what you were looking for. It would, thus, know that “a two bedroom house in Los Angeles” is closer in meaning to “a residence with two rooms in sunny California” than to “a two story house in Los Angeles”.
38 |
39 | Exact wording becomes less important in a semantic search
40 |
41 | There is often a difference between the wording used in your query and the wording used in the results you are after. Semantic search engines can deal with this whereas classical non-semantic search engines cannot.
42 |
43 | (2) Context is taken into account 📖
44 | People will often use search engines to look for information based on a description of a concept rather than on a set of literal keywords. For instance, when you search for “Python”, a document called “learn 5 fun facts about Python snakes” and a document called “Python programming tutorial” are both potentially relevant results since it is not clear which type of Python you are referring to.
45 |
46 | However, when you search “learn Python”, a semantic search engine will recognize that the word “Python” in this context is more likely to refer to the programming language than to the type of snake. Therefore it will deem the programming tutorial more relevant. Don’t trust me? Try it out for yourself: search for “Python” in Google images and for “learn Python” afterwards and watch the snakes disappear.
47 |
48 | Context is taken into account in a semantic search
49 |
50 | A classical non-semantic search engine, however, will deem “learn 5 fun facts about Python snakes” more relevant because this title shares more common words with your query as it contains both “learn” and “Python”.
51 |
52 | Context is not taken into account in a lexical search, only exact wording
53 | How does Semantic Search work?
54 |
55 | Now you may be asking yourself how this all works. 🤔
56 |
57 | In a nutshell, semantic search works by comparing sentence embeddings: you are looking for a result with a sentence embedding that is close to the sentence embedding of your query. Did that make sense to you? Perfect, feel free to skip this part. If it didn’t, don’t worry, I will give you a quick NLP crash course.
58 |
59 | What are (sentence) embeddings?
60 |
61 | Embeddings play a central role in nearly any NLP application. Basically an embedding is a vector of numbers that represents the meaning of a word. Concretely, this entails that words with similar meanings have similar embeddings and vice versa. The graph below visualizes (3-dimensional) embeddings. You can clearly see that words with similar meanings have similar embeddings. For instance, we notice that words related to school have low X-values and low Z-values, words related to sports have high X-values and high Z-values, etc.
62 |
63 | Visual representation of word embeddings
64 |
65 | How does it arrive at these embeddings you ask? Well, nowadays via transformer-based language models [3]. You can usually recognize these models because their name will be some pun on the name BERT. NLP researchers are language enthusiasts after all, so silly puns are inevitable.
66 |
67 | Now, you may be thinking “Isn’t transformers that movie about car robots?”. Sorry to disappoint but no. Well technically yes it is but unfortunately Shia LaBeouf doesn’t really do NLP (yet). Very oversimplified, transformers are language models that will process massive amounts of text and try to figure out which words often appear in the same context (usually via a technique called masked-language modeling [4]) and place the embeddings for these words close together as they can be used (somewhat) interchangeably.
68 |
69 | By now, you should be familiar (on a high-level) with what word embeddings are. Sentence embeddings are the completely analogous counterpart of word embeddings with the major difference being that they produce one embedding for a sequence of words rather than an embedding for each word individually. So texts with similar meanings have sentence embeddings that are close together.
70 |
71 | Now that you are also familiar with sentence embeddings, let’s take another look at the statement made at the beginning of this paragraph: “semantic search works by comparing sentence embeddings: you are looking for a result with a sentence embedding that is close to the sentence embedding of your query”. Does this make more sense now? Perfect. (I will assume you answered yes 🤞).
72 |
73 | How do you implement semantic search?
74 |
75 | As you will soon discover, semantic search is a broad and very nuanced topic. Therefore, in order to structure our thoughts, we will follow the flowchart below as a guide throughout our journey into the semantic search wilderness.
76 |
77 | Semantic search flowchart
78 |
79 | It may not look very informative right now but it will make sense soon, I promise.
80 |
81 | Is semantic search for you?
82 | Semantic search flowchart part 1
83 |
84 | Before you get straight to implementing, let’s stand still for a moment and consider whether semantic search is even appropriate for your use case. As mentioned before, the main pro’s of semantic search are being able to deal with synonyms and with context. Therefore, it is important to make the critical reasoning of whether either of these advantages are relevant to your use case.
85 |
86 | Say for example that you have a database of medical documents and will use it to look for documents containing specific names of diseases or specific codenames. In this case, synonyms aren’t of much use to you (a disease only has one name) and if you are only searching for the name of the disease, there isn’t much context to take into account either.
87 |
88 | Semantic search is not always an added-value
89 |
90 | Therefore, the added value that semantic search will bring is likely quite limited so you’ll probably be better of going for a lexical search engine. TF-IDF [5] or especially BM25 [6] are (at least at the time of writing) solid lexical search algorithms to consider. I won’t go into how these algorithms work but if you’re interested, check out our upcoming more technical semantic search blogpost.
91 |
92 | In conclusion, there are two important considerations here:
93 | (1) Will your users search for concepts? For instance, will your users search for “house” or “two-bedroom villa in LA with outdoor pool”? Taking the context into account really only pays dividends for the latter.
94 | (2) Do you have a diverse dataset? For instance, do you want to search through employee contracts in the finance department or through all documents in your company. In diverse data, exact wording makes a bigger difference and thus semantic search pays off more.
95 |
96 | Do you need domain adaptation?
97 | Semantic search flowchart part 2
98 |
99 | Now that you have decided that semantic search is suited to your specific use case, you will need a sentence-transformer to get you started. Nowadays, there are many open-source options to choose from. Check them out here. If the domain (e.g., legal text, technical manuals, etc.) of your data is quite generic such as English news, Spanish articles, etc., you will likely find a sentence-transformer suited to not only your language but also to your domain.
100 |
101 | However, maybe you consider yourself an Icelandic medical document aficionado. Perhaps you enjoy Slovenian legal texts. Honestly, who can blame you? Don’t we all?
102 |
103 | This could be you!
104 |
105 | Unfortunately, for these specific situations, you’ll probably have a difficult time finding a language model specifically suited to that domain. Language models trained on generic text in that language are likely the best option you’ll find.
106 |
107 | However, I think you can imagine that “everyday” text looks very different to medical text. A general-purpose Icelandic language model probably rarely encountered the medical jargon words (e.g., disease names, anatomical words, etc.) that play an important role in medical documents. Therefore, a general-purpose language model will likely have a hard time accurately figuring out the meaning of medical texts.
108 |
109 | In such a scenario, you would want to use a technique called “domain adaptation”. Essentially, this entails that you take your general-purpose language model that was trained on general text and continue training it on data from your domain of interest. Again, I won’t get into the nitty gritty of how this all works: check out our upcoming more technical blogpost for that.
110 |
111 | Texts from other domains can be wildly different
112 |
113 | However, you should see this process as follows. Say you don’t speak a word of Slovenian. Training a language model would then entail giving you massive amounts of everyday Slovenian text to study. At the end, you’ll probably be pretty familiar with Slovenian. However, if I were to give you Slovenian legal text and ask you some questions about it, you probably still wouldn’t be able to answer them. Well, domain adaptation would entail sending you to Slovenian law school and making you study Slovenian legal texts. Although this seems like a pretty unpleasant experience (my apologies to any Slovenian lawyers reading this), you would be able to grasp the meaning of Slovenian legal texts afterwards.
114 |
115 | How do you implement a semantic search engine?
116 | Semantic search flowchart part 3
117 |
118 | At this point, you are quite close to a functioning semantic search system. You already made the decision to opt for semantic search over the alternatives and you have a suitable language model that will underly your solution. Now, let’s focus on a drawback of semantic search: it can be quite slow and compute-intensive. If computational resources and latency aren’t issues for you, then you can simply accept this drawback and you have yourself a search engine (congrats 🎊).
119 |
120 | In most (practical) cases, however, these are significant issues. Don’t worry though, there is a solution! That solution is called “Retrieve & Rerank”. Essentially, the idea here is to use a (fast and inexpensive) lexical search algorithm (such as TF-IDF or BM25 as discussed earlier) to filter out the irrelevant documents in your database so that you arrive at a smaller subset of “candidate documents” that could all potentially be relevant. This is the “retrieve” step. Then you use your semantic search engine to search only within these “candidate documents” rather than within all documents in your database. This is the “rerank” step. Graphically, the flow will look something like the graph below.
121 |
122 | Retrieve & Rerank hybrid architecture
123 |
124 | Let’s illustrate with a very practical example. You have a database containing 100.000 documents. Let’s say a lexical search within 1.000 documents takes around 1 ms (= 0.001 seconds) whereas a semantic search within 1.000 documents takes around 50 ms (= 0.05 seconds). A full semantic search within all 100.000 documents would thus take (0.05 * 100 =) 5 seconds.
125 |
126 | Now, let’s say we first do a lexical search within the 100.000 documents and only keep the top 5.000 documents. This takes (0.001 * 100 =) 0.1 seconds. We can then perform a semantic search within only these 5.000 documents. This takes (0.05 *5 =) 0.25 seconds. We will likely end up with largely similar results in both situations although the full semantic search approach took 5 seconds whereas the retrieve & rerank approach only took 0.35 seconds and used less computational resources.
127 |
128 | Conclusion
129 |
130 | If you reached this point in the post, you should have a solid (high-level) understanding of semantic search along with the main practical considerations that come into play when you try to implement a semantic search system in a real-life scenario.
131 |
132 | I have already made a few shameless plugs but if you are interested in the technical aspects related to semantic search, feel free to check out our upcoming more technical blogpost on the topic.
133 |
134 | If want to read up on some real-life semantic search use cases, I can highly recommend taking a look at the case studies on our work for Fednot (Belgian federation of notaries) and Funke (German media group).
135 |
136 | I hope this blogpost was what you were searching for! (pun very much intended)
137 |
138 | References
139 |
140 | [1]: Bill Inmon. (June 28 2016). Why Do We Call Text “Unstructured”?
141 | https://tdwi.org/articles/2016/06/28/text-unstructured.aspx
142 |
143 | [2]: IDC’s Information Worker Survey. (June 2012).
144 | https://metajure.com/lawyers-waste-six-hours-a-week-on-document-management-issues-2/
145 |
146 | [3]: Devlin, et al. (24 May 2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
147 | https://arxiv.org/pdf/1810.04805.pdf
148 |
149 | [4]: James Briggs. (19 May 2021). Masked-Language Modeling With BERT
150 | https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c
151 |
152 | [5]: Anirudha Simha. (October 6 2021). Understanding TF-IDF for Machine Learning
153 | https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
154 |
155 | [6]: Vinh Nguyên. (1 September 2019). Okapi BM25 with Game of Thrones
156 | https://blog.mimacom.com/bm25-got/
157 |
158 | NLP
159 | Semantic Search
160 | Sentence Embedding
161 | Transformers
162 | Sentence Transformers
163 |
164 |
--------------------------------------------------------------------------------
/Optimization/understanding-automatic-text-summarization-1.txt:
--------------------------------------------------------------------------------
1 | Title: Understanding Automatic Text Summarization-1: Extractive Methods | by Abhijit Roy | Towards Data Science
2 | ----------------------------------------------------
3 | source: https://towardsdatascience.com/understanding-automatic-text-summarization-1-extractive-methods-8eb512b21ecc#
4 |
5 | Understanding Automatic Text Summarization-1: Extractive Methods
6 | How can we summarize our documents automatically?
7 |
8 | author is Abhijit Roy
9 |
10 |
11 | Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short summaries with all the important points over reading a whole report and summarizing it ourselves. So, several attempts had been made to automate the summarizing process. In this article, we will talk about some of them and see how they work.
12 |
13 | What is summarization?
14 |
15 | Summarization is a technique to shorten long texts such that the summary has all the important points of the actual document.
16 |
17 | There are mainly four types of summaries:
18 |
19 | Single Document Summary: Summary of a Single Document
20 | Multi-Document Summary: Summary from multiple documents
21 | Query Focused Summary: Summary of a specific query
22 | Informative Summary: It includes a summary of the full information.
23 | Approaches to Automatic summarization
24 |
25 | There are mainly two types of summarization:
26 |
27 | Extraction-based Summarization: The extractive approach involves picking up the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.
28 |
29 | Abstraction-based Summarization: The abstractive approach involves summarization based on deep learning. So, it uses new phrases and terms, different from the actual document, keeping the points the same, just like how we actually summarize. So, it is much harder than the extractive approach.
30 |
31 | It has been observed that extractive summaries sometimes work better than the abstractive ones probably because extractive ones don’t require natural language generations and semantic representations.
32 |
33 | Evaluation methods
34 |
35 | There are two types of evaluations:
36 |
37 | Human Evaluation
38 | Automatic Evaluation
39 |
40 | Human Evaluation: Scores are assigned by human experts based on how well the summary covers the points, answer the queries, and other factors like grammaticality and non-redundancy.
41 |
42 | Automatic Evaluation
43 |
44 | ROUGE: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is the method that determines the quality of the summary by comparing it to other summaries made by humans as a reference. To evaluate the model, there are a number of references created by humans and the generated candidate summary by machine. The intuition behind this is if a model creates a good summary, then it must have common overlapping portions with the human references. It was proposed by Chin-Yew Lin, University of California.
45 |
46 | Common versions of ROUGE are:
47 |
48 | ROUGE-n: It is measure on the comparison between the machine-generated output and the reference output based on n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech, i.e, it is simply a sequence of words. Bigrams mean two words, Trigrams mean 3 words and so on. We normally use Bigrams.
49 |
50 | Source
51 |
52 | “Where p is “the number of common n-grams between candidate and reference summary”, and q is “the number of n-grams extracted from the reference summary only”. -Source
53 |
54 | ROUGE-L: It states that the longer the longest common subsequence in two texts, the similar they are. So, it is flexible than n-gram. It assigns scores based on how long can be a sequence, which is common to the machine-generated candidate and the human reference.
55 |
56 | ROUGE-SU: It brings a concept of skip bi-grams and unigrams. Basically it allows or considers a bigram if there are some other words between the two words, i.e, the bigrams don’t need to be consecutive words.
57 |
58 | ROUGE-2 is most popular and given by:
59 |
60 | Where for every bigram ‘i’ we calculate the minimum of the number of times it occurred in the generated document X and the reference document S, for all reference documents give, divided by the total number of times each bigram appears in all of the reference documents. It is based on BLEU scores.
61 |
62 | Feature-Based Summarization: Developed by H. P Luhan at IBM in 1958. The paper proposed that the importance of a sentence is a function of the high-frequency words in the document. Elaborately speaking, the algorithm measures the frequency of words and phrases in the document and decides the importance of the sentence considering the words in the sentence and their frequencies. It states if the sentence has words with higher frequencies, It is important, but here we do not include the common words like “a”,” the”. Etc.
63 |
64 | Extractive Summarizations: “Extractive summarization techniques produce summaries by choosing a subset of the sentences in the original text”.-Source
65 |
66 | The Extractive Summarizers first create an intermediate representation that has the main task of highlighting or taking out the most important information of the text to be summarized based on the representations. There are two main types of representations:
67 |
68 | Topic representations: It focuses on representing the topics represented in the texts. There are several kinds of approaches to get this representation. We here are going to talk about two of them. Others include Latent Semantic Analysis and Bayesian Models. If you want to study others as well I will encourage you to go through the references.
69 | Frequency Driven Approaches: In this approach, we assign weights to the words. If the word is related to the topic we assign 1 or else 0. The weights may be continuous depending on the implementation. Two common techniques for topic representations are:
70 | Word Probability: It simply uses the frequency of words as an indicator of the importance of the word. The probability of a word w is given by the frequency of occurrences of the word, f (w), divided by all words in the input which has a total of N words.
71 | Source
72 |
73 | For sentence importance using the word probabilities, the importance of a sentence is given by the average importance of the words in the sentence.
74 |
75 | TFIDF.(Tern Frequency Inver Document Frequency): This method is devised as an advancement to the word probability method. Here the TF-IDF method is used for assigning the weights. TFIDF is a method that assigns low weights to the words that occur very frequently in most of the documents under the intuitions that they are stopwords or words like “The”. Otherwise, due to the term frequency if a word appears in a document uniquely with a high frequency it is given high weightage.
76 | Topic word Approaches: This approach is similar to Luhan’s approach. “The topic word technique is one of the common topic representation approaches which aims to identify words that describe the topic of the input document”.-Source This method calculates the word frequencies and uses a frequency threshold to find the word that can potentially describe a topic. It classifies the importance of a sentence as the function of the number of topic words it contains.
77 | Indicator Representations: This type of representation depend on the features of the sentences and rank them on the basis of the features. So, here the importance of the sentence is not dependent on the words it contains as we have seen in the Topic representations but directly on the sentence features. There are two methods for this type of representation. Let’s look at them.
78 | Graph-Based Methods: This is based on the Page Rank algorithm. It represents text documents as connected graphs. The sentences are represented as the nodes of the graphs and edges between the nodes or the sentences are a measure of similarity between the two sentences. We will talk about this in detail in the upcoming portions.
79 | Machine-Learning Methods: The machine learning methods approach the summarization problem as a classification problem. The models try to classify sentences based on their features into, summary or non-summary sentences. For training the models, we have a training set of documents and their corresponding human reference extractive summaries. Normally Naive Bayes, Decision Tree, and SVMs are used here.
80 | Scoring and Sentences Selection
81 |
82 | Now, once we get the intermediate representations, we move to assign some scores to each sentence to specify their importance. For topic representations, a score to a sentence depends on the topic words it contains, and for an indicator representation, the score depends on the features of the sentences. Finally, the sentences having top scores, are picked and used to generate a summary.
83 |
84 | Graph-Based methods
85 |
86 | The graph-based methods were first introduced by a paper by Rada Mihalcea and Paul Tarau, University of North Texas. The method is called the Text Rank algorithm and is influenced by Google’s Page Rank Algorithm. This algorithm primarily tries to find the importance of a vertex in a given graph.
87 |
88 | Now, how the algorithm works?
89 |
90 | We have learned in this algorithm, each sentence is represented as a vertex. An edge joining two vertices or two sentences denotes that the two sentences are similar. If the similarity of any two sentences is greater than a particular threshold, the nodes representing the sentences are joined by an edge.
91 |
92 | When two vertices are joined, it portrays that, one vertex is casting a vote to the other one. More the number of votes to a particular node( vertex or sentence), more important is that node and apparently the sentence represented. Now, the votes are also kind of weighted, each vote is not of the same weight or importance. The importance of the vote also depends on the importance of the node or sentence casting the vote, higher the importance of the node casting the vote higher is the importance of the vote. So, the number of votes cast to a sentence and the importance of those votes determine the importance of the sentence. This is the same idea behind the google page rank algorithm, and how it decides and ranks webpages, just that the nodes represent the webpages.
93 |
94 | If we have a paragraph, we will decompose it into a set of sentences. Now,say we represent each sentence as a vertex ‘vi’ so, we obtain a set of vertices V. As discussed, an edge joins a vertex with another vertex of the same set, so an edge E can be represented as a subset of (V x V). In the case of a directed graph say, In(V{i}) is the number of incoming edges to a node and the Out(v{j}) is the number of outgoing edges from a given node, and the importance score of a vertex is given by S{j}.
95 |
96 | Page Rank Algorithm
97 |
98 | According to the google page rank algorithm,
99 |
100 | Source
101 |
102 | Where S(V{i}) is the score of the subject node under consideration, and S(V(j)) represents all the nodes that have outgoing edges to V{i}. Now, the score of V{j} is divided by the out-degree of V{j}, which is the consideration of the probability that the user will choose that particular webpage.
103 |
104 | Elaborately, if this is the graph, standing at A, as a user, I can go to both B and C, so the chance of me going to C is ½, i.e, 1/( outdegree of A). The factor d is called the damping factor. In the original page rank algorithm, factor d incorporates randomness. 1-d denotes that the user will move to a random webpage, not going to the connected ones. The factor is generally set to 0.85. The same algorithm is implemented in the text-rank algorithm.
105 |
106 | Now, the question arises, how we obtain the scores?
107 |
108 | Let’s check for the page rank algorithm first, then transform it for text rank. As we can see above there are 4 vertices, first, we assign random scores to all the vertices, say, [0.8,0.9,0.9,0.9]. Then, probability scores are assigned to the edges.
109 |
110 | The matrix is the adjacent matrix of the graph. It can be observed that the values of the adjacent matrices are the probability values i.e, 1/outdegree of that node or vertex. So, actually the page rank graph becomes unweighted as the equation only contains the term that gives the weight.
111 |
112 | Now, the whole equation becomes,
113 |
114 | We can see, the old score matrix is multiplied using the adjacency matrix to get the new score matrix. We will continue this until the L2 norm of the new score matrix and the old score matrix becomes less than a given constant mostly 1 x10^-8. This is a convergence property based on linear algebra and the theory of eigenvalues and vectors. We will skip the maths to keep it simple. Once the convergence is achieved we obtain the final importance scores from the score matrix.
115 |
116 | For the text rank algorithm, the equation and the graph are modified to a weighted graph, because here, just dividing with the out-degree won’t convey the full importance. As a result, the equation becomes:
117 |
118 | Source
119 |
120 | W represents the weight factor.
121 |
122 | The implementation of text rank consists of two different natural language processes:
123 |
124 | A keyword extraction task, which selects keywords and phrases
125 | A sentence extraction task, this identifies the most important sentences.
126 |
127 | Keyword extraction task
128 |
129 | Previously this was done using the frequency factor, which gave poor results comparatively. The text rank paper introduced a fully unsupervised algorithm. According to the algorithm, the natural language text is tokenized and parts of speech are tagged, and single words are added to the word graph as nodes. Now, if two words are similar, the corresponding nodes are connected using an edge. The similarity is measured using the co-occurrences of words. If two words occur in a window of N words, N varying from 2 to 10, the two words are considered similar. The words with the maximum number of important incident edges are selected as the most important keywords.
130 |
131 | Source
132 |
133 | Sentence extraction task
134 |
135 | It also works similar to keyword extraction, the only difference is in keyword extraction, the nodes represented keywords, here they represent entire sentences. Now, for the formation of the graph for sentence ranking, the algorithm creates a vertex for each sentence in the text and adds to the graph. The sentences are too large so, co-occurrence measures can not be applied. So, the paper uses a “similarity” between two sentences using the content overlap between two sentences, in simpler words, the similarity depends on the number of common word tokens present in the two sentences. The authors propose a very interesting “recommendation” insight here. They denote the joining of the edge between two similar sentences or vertices as if it is recommending the reader to read another line, which is similar to the current line he/she is reading. The similarity I feel therefore denotes the similar content or interest among the two sentences. To prevent long sentences from getting recommended the importances are multiplied with a normalizing factor.
136 |
137 | Source
138 |
139 | The similarity between two sentences is given by:
140 |
141 | Source
142 |
143 | Where given two sentences Si and Sj, with a sentence being represented by the set of Ni words that appear in the sentence:
144 |
145 | Source
146 |
147 | The most important sentences are obtained in the same way we did for Keyword extraction.
148 |
149 | This is an overall view of how text rank operates please go through the original paper to explore more.
150 |
151 | In practice, for summary extraction, we use cosine similarity, to decide the similarity between two sentences. Using this method, we may obtain several connected subgraphs that denote the number of important topics in the whole document. The connected component of the subgraphs gives the sentences important for the corresponding topics.
152 |
153 | The “Pytextrank” library allows applying the text rank algorithm directly on python.
154 |
155 | import spacy
156 | import pytextrank
157 |
158 | # example text
159 | text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
160 |
161 | # load a spaCy model, depending on language, scale, etc.
162 | nlp = spacy.load("en_core_web_sm")
163 |
164 | # add PyTextRank to the spaCy pipeline
165 | tr = pytextrank.TextRank()
166 | nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
167 |
168 | doc = nlp(text)
169 |
170 | # examine the top-ranked phrases in the document
171 | for p in doc._.phrases:
172 | print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
173 | print(p.chunks)
174 |
175 | Implementation of Pytextrank library by Source.
176 |
177 | For application details, please refer to the GitHub link.
178 |
179 | Conclusion
180 |
181 | In this article, we have seen basic extractive summarization approaches and the details of the Textrank algorithm. For abstractive methods, feel free to go through Part 2 of the article.
182 |
--------------------------------------------------------------------------------
/LLMonYourDomain.txt:
--------------------------------------------------------------------------------
1 | Title: Leveraging LLMs on your domain-specific knowledge base | by Michiel De Koninck | May, 2023 | ML6team
2 | -----------------------------------------------------------------------------------------------------------
3 |
4 | Write
5 | 2
6 | Leveraging LLMs on your domain-specific knowledge base
7 | With RAG to Riches: wielding the power of LLMs using Retrieval-Augmented Generation to talk to your data
8 |
9 | Michiel De Koninck
10 |
11 | ·
12 |
13 | Following
14 |
15 | Published in
16 |
17 | ML6team
18 |
19 | ·
20 | 10 min read
21 | ·
22 | May 8
23 |
24 | 150
25 |
26 | 1
27 |
28 | Ask ChatGPT a question about the origin of the word “marathon” and it will accurately tell you how Herodotus described the legendary 42km run that Pheidippides completed from Marathon to Athens before collapsing from exhaustion.
29 |
30 | But what about my grandmother’s list of recipes? Sure, I can digitise those recipes, no problem. But what if I want to have advice on what meal to prepare based on the ingredients in my fridge, my favourite color and my mood for the day?
31 |
32 | Let’s see if that’s possible without collapsing from exhaustion.
33 |
34 | LLMs, meet your limits … and exceed them
35 |
36 | An LLM is a Large Language Model. OpenAI’s GPT-4 is one example, Meta’s LLamA is another. We make the conscious choice here to stick to the general LLM term to refer to these models. Bear in mind: each of these models were trained on a gigantic set of (publicly available) data.
37 |
38 | It has been clearly demonstrated by now that these LLMs have a meaningful understanding of general language and that they are able to (re)produce information relevant to the information that was present in their training data. This is why generative tools like ChatGPT perform astonishingly well at answering questions about topics that the LLM encountered during its training.
39 |
40 | But what remains out of the direct grasp of those massive LLMs is the data that is so valuable within each organisation: the internal knowledge base. The question that thus massively pops up is:
41 |
42 | How can we leverage the power of these LLMs in unlocking information stored in a specific knowledge base upon which it wasn’t originally trained?
43 |
44 | Oh okay, so to do this, can’t we just introduce our internal knowledge base as extra data upon which the LLM should be trained? Or, if you will, can we fine-tune the LLM on our specific knowledge base.
45 |
46 | Yes, you most likely can. But for reliable question answering, it might not be the way to go.
47 |
48 | Why fine-tuning won’t always cut it
49 |
50 | Meet Billy the Bookworm. Billy is a large language model and he has devoured a gigantic amount of online information, empowering with enormous knowledge. Billy however, smart as he is, has not read through the books in your very specific home library.
51 |
52 | Fine-tuning is this: presenting Billy the Bookworm with all the books in your very specific knowledge base and letting him gobble up all that tasty extra information. This way, the LLM bookworm Billy doesn’t just know all of that general information, he also “knows” a lot about the contents of your specific knowledge base.
53 |
54 | Classical approach of fine-tuning on domain specific data (all icons from flaticon)
55 |
56 | Congratulations, through this fine-tuning proces you’ve turned Billy into a very specific Billy that knows so much about your specific domain! Below we show how you could start putting Billy to work. By posing questions to your improved bookworm, you can expect answers that use both the information from its gigantic general training set and information stored in your specific knowledge base.
57 |
58 | Leveraging the fine-tuned LLM to ask questions about your internal knowledge base.
59 |
60 | While certainly powerful, the crucial problem with this solution approach, is that you still have little insights into how your bookworm came up with its answers. Moreover, fine-tuning an LLM has its (costly) consequences.
61 |
62 | We list the main reasons why fine-tuning Billy comes up short:
63 |
64 | No source clarity. It’s difficult to prevent hallucination and your LLM has no clear distinction between “general” and “specific” knowledge.
65 | No access restriction. Imagine a case where some users should be able to query the information of strategic documents while others shouldn’t. How would you tackle this? Your fine-tuned Billy just knows everything, he can’t choose to leave out knowledge at inference time.
66 | Hosting an LLM is costly. Once you have a fine-tuned LLM, you have to keep it spinning. A large language model is well… large. Costs to keep it up and running will rack up. Do the benefits outweigh those costs?
67 | Fine-tuning repetitions. Model retraining is required when you want the model to reflect changes to the knowledge base.
68 |
69 | Luckily, all these problems are solvable. If answering questions in a verifiable way and preventing hallucination is what you are looking for: you may not need the hyper-modern bookworm, let’s just ask the good old librarian where to find the answers to your questions.
70 |
71 | With RAG to Riches
72 |
73 | The idea behind Retrieval-Augmented Generation (RAG) is quite straight-forward. Remember, the goal is to unlock the information in our knowledge base. Instead of unleashing (i.e. fine-tuning) our bookworm on it, we comprehensively index the information of our knowledge base.
74 |
75 | By indexing the embeddings of your internal knowledge base, you unlock smart search capabilities.
76 |
77 | In the schema above, we illustrate how the Smart Retriever functions like a librarian. Ideally, the librarian has perfect knowledge of what is in his library. For a visitor asking a certain question, he would know just which chapter of which book to recommend.
78 |
79 | On a more technical level, this describes a semantic search engine. In this case, the embeddings are vectorial representations of document sections and they allow a mathematical description of the actual meaning stored in each section. By comparing embeddings, we can determine which text sections are similar in meaning to which other text sections. This is crucial for the retrieval process displayed below.
80 |
81 | Through leveraging our Smart Retriever, we can force our generator to stick to the content of our knowledge base that is most relevant for answering the question. Et voilà: Retrieval-Augmented Generation.
82 |
83 | In play are two crucial components:
84 |
85 | The Smart Retriever (i.e. the librarian)
86 | The Generator (i.e. the bookworm)
87 |
88 | It should be clear by now why this approach is called Retrieval-Augmented Generation. Based on the question asked, you first retrieve the most relevant information from your internal knowledge base; you then augment the typical generation phase by passing that relevant information explicitly to the generator component.
89 |
90 | Key highlights of this RAG-based setup
91 | Clear indication of the source upon which the answer was based. Allowing for validation of the answer returned by the generator.
92 | Very unlikely to hallucinate, by restricting our generator component to the corpus of our knowledge base, it will admit it can’t formulate a response when no relevant sources were found by the retriever.
93 | Maintainable Search Index. A knowledge base is a living thing, when it changes, we can adapt our Search Index to reflect those changes.
94 |
95 | Aside from those highlights, the multi-lingual aspect of LLMs is a thing of beauty. You can have a knowledge base consisting of purely Italian recipes which your pasta-loving French friend can talk to in an all-French dialogue.
96 |
97 | Fine-tuning revisited
98 |
99 | Note that in the section above, we dismissed fine-tuning as a valuable option because we had little control on source clarity thus increasing the risk for hallucination.
100 |
101 | It must be noted that the RAG approach, powered by a general LLM, only works well as long as the specific knowledge base does not contain super specific jargon that the LLM can’t understand from its general training.
102 |
103 | Imagine you need the responses of your solution to follow ‘the tone and lingo’ that is present in your knowledge base. In this case the fine-tuning of your LLM seems less avoidable.
104 |
105 | It could be a valid approach to be able to handle specific jargon and then incorporate your fine-tuned LLM in the RAG architecture to reap the combined benefits. Instead of working with a general bookworm, you would then use your specifically trained Billy to power the Generator and/or the Smart Retriever components.
106 |
107 | Why now? What’s new?
108 |
109 | Excellent question.
110 | Semantic Search (smart retrieval) has been around for quite some time and so has generative AI (some primitive forms have been around for decades).
111 | However, we have seen pivotal advancements over the last months.
112 |
113 | On a technological level, we’ve recently witnessed big leaps forward in LLM performance. These positively impact the RAG solution on two levels:
114 |
115 | Embeddings (e.g. Embedding API by OpenAI or Google’s PaLM)
116 | Generative capabilities (e.g. OpenAI’s ChatGPT solution)
117 |
118 | Accompanying that improved generative quality is the increase in traction. Previously, companies could not easily imagine the opportunities of a system relying on generative AI. Now however, thanks to the wide media coverage and adoption of tools like ChatGPT, overall interest has grown exponentially.
119 |
120 | So, though arguably mediocre versions of RAG might have been possible for quite some time, technological improvements and increased traction result in a fruitful market opportunity.
121 |
122 | Challenges on your way to success
123 |
124 | In this section, we aim to introduce you to some of the main challenges with setting up a successful RAG solution.
125 |
126 | Strong dependency on the performance of the Smart Retriever.
127 | The quality of the responses given by your Generative Component will depend directly on the relevancy of the information handed to it by the Smart Retriever. As mentioned above, we can thank LLM advancements for giving us rich and powerful text embeddings. But fetching these embeddings purely via API’s may not be your best option. You should be very conscious when designing your Semantic Search component, perhaps your knowledge base has specific jargon and you might need a custom fitted (i.e. fine-tuned) component to handle it. A more in-depth practical guide on Semantic Search can be found in this blogpost [1] .
128 | Trade-off to be made in restriction to stick to info in knowledge base.
129 | As explained in the RAG architecture, we can force our LLM generative component to restrict itself to the information found in the relevant documents. While this ensures that hallucination (i.e. non-sensical answers) has little chance, it also means you are barely leveraging the information your LLM possesses. You might want your solution to use that knowledge as well but maybe only when requested by the user.
130 | Conversational design to allow complex dialogue.
131 | While our depictions above have represented the user behaviour as asking merely a “one-shot question”, often your user might want to zoom in on the answer provided by your solution (in a ChatGPT-style conversation). Luckily, tools exist to aid you in this battle. The langchain framework offers a helping hand in getting this just right.
132 | Prompt engineering as a way to steer generation toward succes.
133 | To get the answer of your generative component just right, you need to tell it exactly what kind of output you expect. Overall, this is far from rocket science. But getting your prompt setup just right for your use case takes time and deserves enough attention. It may be worthwhile looking at prompt management systems to make sure you can keep track of which prompting works best for which situations.
134 | Choosing the right LLM: what does it cost and where does my data go?
135 | Throughout this text, we haven’t made any explicit choice regarding what LLM(s) to use in your solution. When choosing which LLM (API) to use, make sure to take privacy and cost restrictions into consideration. There are quite some decent options out there already. We have OpenAI’s GPT, Meta’s LLaMA, Google’s PaLM and with Elon Musk claiming to join the LLM scene, who knows where things will go. The exciting news is: more options will come and competition should drive LLM performance up and prices down.
136 | Getting and keeping your LLM solution in production (LLMOps).
137 | As with all mature AI solutions: building them is one thing, getting/keeping them in production is another. The field of LLMOps focusses on the operationalisation of LLMs. Monitoring the performance of your LLM-based solution, keeping your knowledge base and search index up-to-date, processing conversational history…
138 | Before flinging your LLM solution into production, think wisely about how to maintain it and how to keep it fruitful in the long run.
139 |
140 | Charmed by the potential of RAG and intrigued by the related challenges, we now move on to looking at an actual RAG-based solution.
141 |
142 | Getting your hands dirty with RAG
143 |
144 | If your interest is sparked by the concept of Retrieval-Augmented Generation, you may be asking yourself:
145 |
146 | Do I have what it takes to take a RAG-based solution for a spin?
147 |
148 | Well, if you have:
149 |
150 | specific knowledge: a moderate (preferably organised) database of “knowledge articles” that contain useful information that is not easily found on the world-wide web (e.g. technical documents, onboarding guidelines, handled support tickets…)
151 | business value: a clear definition of business value if that information could be unlocked for the intended users
152 |
153 | Then yes, RAG might be the way to go for you.
154 |
155 | As an experiment, we recently built a small demo to showcase how this technology can be leveraged to support government staff in answering parliamentary questions more easily.
156 | In this case, the specific knowledge consists of:
157 |
158 | a set of Flemish legislative documents
159 | a set of parliamentary questions of the past
160 |
161 | The business value, meanwhile is in:
162 |
163 | improving efficiency by automatically suggesting answers to parliamentary questions based on the Flemish knowledge base
164 | improving transparency and user adoption through explicit citations
165 | Screenshot of demo solution built around the case of “application to help answer parliamentary questions”
166 |
167 | If you are looking for some guidelines on how to technically implement a similar solution, stay tuned for a follow-up blogpost where we aim to zoom in on the technical details of setting up RAG.
168 |
169 | References
170 | Mathias Leys (May 9, 2022) Semantic search: a practical overview https://blog.ml6.eu/semantic-search-a-practical-overview-bf2515e7be76
171 | Artificial Intelligence
172 | Large Language Models
173 | Semantic Search
174 | Fine Tuning
175 |
176 | 150
177 |
178 | 1
179 |
180 | Written by Michiel De Koninck
181 | 28 Followers
182 | ·
183 | Editor for
184 |
185 | ML6team
186 |
187 | Mathematical Engineer with a healthy knack for the world of AI.
188 |
189 | Following
190 | More from Michiel De Koninck and ML6team
191 |
192 | Mathias Leys
193 |
194 | in
195 |
196 | ML6team
197 |
198 | Dynamic pricing in practice
199 | How to create a dynamic pricing engine in practice
200 | 11 min read
201 | ·
202 | Jan 5
203 |
204 | 102
205 |
206 | 3
207 |
208 | Mathias Leys
209 |
210 | in
211 |
212 | ML6team
213 |
214 | The Art of Pooling Embeddings 🎨
215 | In this blogpost we will discover the complexity of pooling that hides behind its apparent simplicity.
216 | 8 min read
217 | ·
218 | Jun 20, 2022
219 |
220 | 174
221 |
222 | 2
223 |
224 | Sebastian Wehkamp
225 |
226 | in
227 |
228 | ML6team
229 |
230 | A practical guide to anomaly detection using Anomalib
231 | A short guide on unsupervised anomaly detection and how to apply it using Anomalib.
232 | 10 min read
233 | ·
234 | Mar 31, 2022
235 |
236 | 105
237 |
238 | 2
239 |
240 | Mathias Leys
241 |
242 | in
243 |
244 | ML6team
245 |
246 | Decoding Sentence Encoders 🔐
247 | In this blogpost, we will dive into the world of sentence transformers, regular transformers, bi-encoders and cross-encoders.
248 | 8 min read
249 | ·
250 | Jun 13, 2022
251 |
252 | 181
253 |
254 | 4
255 |
256 | See all from Michiel De Koninck
257 | See all from ML6team
258 | Recommended from Medium
259 |
260 | Steve Yegge
261 |
262 | We’re Gonna Need a Bigger Moat
263 | Everyone making SaaS on LLMs, including coding assistants like Cody and Copilot, was rocked by the AI news events of last week.
264 | 13 min read
265 | ·
266 | May 11
267 |
268 | 850
269 |
270 | 9
271 |
272 | Ameer Hakme
273 |
274 | Unlocking Context-Aware Insights in Tabular Data with LLMs and LangChain
275 | Build a Basic App for Interacting with Numerical Tabular Data through Natural Language Processing using LangChain agents and LLMs
276 | 13 min read
277 | ·
278 | Apr 20
279 |
280 | 37
281 |
282 | Lists
283 | What is ChatGPT?
284 | 9 stories
285 | ·
286 | 36 saves
287 | Staff Picks
288 | 307 stories
289 | ·
290 | 69 saves
291 |
292 | Clive Thompson
293 |
294 | The Devious Genius of “Prompt Injection Attacks” on Today’s Big Language Models
295 | An old trick, revamped for the new age of AI
296 | ·
297 | 6 min read
298 | ·
299 | May 12
300 |
301 | 3.2K
302 |
303 | 19
304 |
305 | Jeremy Sapienza
306 |
307 | in
308 |
309 | Data Reply IT | DataTech
310 |
311 | Pills of AI 101: The Synthetic Data
312 | When having data scarcity is not a problem
313 | 7 min read
314 | ·
315 | 4 days ago
316 |
317 | 44
318 |
319 | dp
320 |
321 | Building Custom Relation Extraction (RE) Models — Part 2
322 | Fine-tune a transformer using the custom labeled dataset from Part 1 to classify relationships between two named-entities
323 | 3 min read
324 | ·
325 | 6 days ago
326 |
327 | 39
328 |
329 | Carlos Giraldo
330 |
331 | Knowledge base with OpenAssistant, Langchain and OpenSearch in OCI
332 | The aim of this post is to explain some of the work I’ve been doing using Langchain to process a document and answer questions about its…
333 | 3 min read
334 | ·
335 | May 10
336 |
337 | 8
338 |
339 | See more recommendations
340 |
341 | Help
342 |
343 | Status
344 |
345 | Writers
346 |
347 | Blog
348 |
349 | Careers
350 |
351 | Privacy
352 |
353 | Terms
354 |
355 | About
356 |
357 | Text to speech
--------------------------------------------------------------------------------
/LaMini-LM - Mini Models Maxi Data! [English] [DownloadYoutubeSubtitles.com].txt:
--------------------------------------------------------------------------------
1 | Okay.
2 |
3 | In this video, I'm going to look at
4 | LaMini-LM a diverse heard of distilled
5 |
6 | models from large-scale instructions.
7 |
8 | so this is based on a paper.
9 |
10 | but they've also released a code for this
11 | and they've released a data set for this.
12 |
13 | I will preface this video at
14 | the start that if you're just
15 |
16 | here to find out if this is the
17 | latest best large language model.
18 |
19 | This is not the latest,
20 | best large language model.
21 |
22 | In fact, all these models
23 | are very tiny by design.
24 |
25 | And so if you're just here for that,
26 | this is probably not the video for you.
27 |
28 | Okay.
29 |
30 | but this is a super interesting paper
31 | and a super interesting project.
32 |
33 | So I definitely want to sort of
34 | cover it and have a look at some
35 |
36 | of the things that they raise.
37 |
38 | And then also we'll have a look
39 | at the models and we'll have a
40 |
41 | look at you know, some of the
42 | code and stuff behind it as well.
43 |
44 | So the key idea here is that what
45 | they're basically trying out is
46 |
47 | that okay, what happens if you go
48 | for a really small model but you
49 |
50 | use a lot more scaled instructions.
51 |
52 | So if you think back to alpaca, alpaca
53 | only had like 52,000, instructions.
54 |
55 | The base model LLaMa
56 | was a very large model.
57 |
58 | That was a 7 billion parameter model.
59 |
60 | but the number of instructions from
61 | Alpaca itself, we're actually very small.
62 |
63 | So the idea here is to do the inverse
64 | of that, of go for a small model, but
65 |
66 | with a lot of distilled instructions.
67 |
68 | So, okay what are they actually doing?
69 |
70 | here, if we look at this diagram, can
71 | sort of see that they're basically,
72 |
73 | they've got some seed instructions.
74 |
75 | They then use, GPT- 3.5 turbo API
76 | to generate synthetic instructions.
77 |
78 | and add those and then they generate
79 | synthetic responses to this.
80 |
81 | And they're dataset gets really big.
82 |
83 | So they've released the dataset here.
84 |
85 | And they've gone right up to
86 | sort of 2.5, 8 million pairs of
87 |
88 | instructions and responses for this.
89 |
90 | Now they've used, the alpaca dataset.
91 |
92 | They've used the self instruct
93 | dataset they've used P3.
94 |
95 | they've used the original flan
96 | dataset as all sort of ways
97 |
98 | to generate more instructions.
99 |
100 | and if we have a look, we can see
101 | that, you know, they've actually
102 |
103 | released the dataset on hugging face.
104 |
105 | so if you want it to actually come
106 | down and train your own big model.
107 |
108 | this is something you certainly can do.
109 |
110 | So here they point out, they give
111 | some a little bit, info about it.
112 |
113 | Then they show that they've actually
114 | trained a lot of models here.
115 |
116 | So we can see that they've trained quite
117 | a few Flan T5, the Cerebras models,
118 |
119 | which I did a video about, even things
120 | like GPT-2 which is kind of insane
121 |
122 | because that model is like super old now.
123 |
124 | they've gone back and trained
125 | up, you know, versions of that,
126 |
127 | that use these instructions.
128 |
129 | And this is one of the models that
130 | we're going to look at, in here.
131 |
132 | and they've also got
133 | other models on the way.
134 |
135 | So the GPT-J and the LLaMa
136 | models, I guess, are training.
137 |
138 | those are definitely gonna take, you
139 | know, for the LLaMa models are gonna
140 |
141 | take quite a while to train with this 2.5
142 | million, dataset that they've got here.
143 |
144 | So let's just jump into the paper
145 | and have a look at what's going on
146 |
147 | here and see what they're doing.
148 |
149 | So they've got the same diagram as
150 | what they've got in GitHub here.
151 |
152 | But, here we obviously go
153 | into a lot more details.
154 |
155 | So there's sort of proposing this idea
156 |
157 | of, if you went for a smaller
158 | model, but a much larger dataset
159 |
160 | for the fine tuning of this.
161 |
162 | And so they look at, what sort of out
163 | there, they talk a bit about, the alpacas,
164 |
165 | all the models that have come after that
166 | with in the seven to 13 billion range.
167 |
168 | And they propose that they're
169 | going to go for, you know, models
170 |
171 | that are much smaller than that.
172 |
173 | So here, they basically talk about
174 | that they're going to go for a dataset
175 |
176 | size of 2.58 million instructions.
177 |
178 | And this is a large part
179 | of what the paper is about.
180 |
181 | The fine tuning of models
182 | and stuff is interesting.
183 |
184 | And their evaluation is also
185 | interesting, but really it's about
186 |
187 | how did they make this dataset?
188 |
189 | because I think there were a lot of
190 | lessons that we can take away from this
191 |
192 | ourselves to make datasets for ourselves.
193 |
194 | They tell us that this was
195 | made using the GPT 3.5 turbo.
196 |
197 | and the idea is this is going to
198 | generate supplementary instructions.
199 |
200 | And they took about, there are two
201 | types of instructions that they use, the
202 |
203 | example guided instruction generation,
204 | and topic guided instruction generation.
205 |
206 | So the cool thing in here is they're got
207 | quite good descriptions of what they do.
208 |
209 | They even give you the prompts for this.
210 |
211 | And I threw the scene to, ChatGPT.
212 |
213 | And we can have a look at
214 | actually, how does this go?
215 |
216 | So you can see here that
217 | they provide three examples.
218 |
219 | and they put example tags around
220 | these and then they basically
221 |
222 | say generate 20 diverse examples.
223 |
224 | So this is the first way of generation.
225 |
226 | this is basically just generating more
227 | examples based on giving it some examples.
228 |
229 | and here they just saying generate
230 | 20 diverse examples that are
231 |
232 | similar to the provided examples.
233 |
234 | You do not need to provide a
235 | response to the generated examples.
236 |
237 | Each example must include an instruction.
238 |
239 | Each generated instruction can either
240 | be an imperative sentence or a question.
241 |
242 | And then basically tells us, you know, how
243 | it's going to export it with these tags.
244 |
245 | And sure enough, you can see
246 | if we dropped that into ChatGPT
247 |
248 | it generates out, examples
249 | quite well with this.
250 |
251 | So that's the first
252 | generation that they do.
253 |
254 | This was the example guided instruction
255 | generation that they've got in there.
256 |
257 | The second turn of generation that
258 | they've got in here is the topic
259 |
260 | guided instruction generation.
261 |
262 | So here they've got a really nice idea
263 | what they do is they go to Wikipedia.
264 |
265 | And they collect, topics from Wikipedia.
266 |
267 | So, first off Wikipedia has 2.2
268 | million topics but what they do
269 |
270 | is they filter these based on a
271 | number of different requirements.
272 |
273 | so first off the topic must
274 | have less than three words.
275 |
276 | they make the point that, if it has
277 | less than three words, it's generally
278 |
279 | going to be a more sort of general
280 | topic that we'll have sub topics.
281 |
282 | And that's one of the things
283 | that they want is they want it to
284 |
285 | have at least, more than 10 sub
286 | categories or sub topics in there.
287 |
288 | And 50 pages of information.
289 |
290 | So they know that this is really
291 | not just a niche thing this
292 |
293 | is a proper category in this.
294 |
295 | Once they go and do that filtering,
296 | you can see that they end up with
297 |
298 | three and a half thousand categories
299 | that serve as common topics.
300 |
301 | They then basically use a
302 | prompt to generate examples
303 |
304 | based on these kind of topics.
305 |
306 | So if we go back and look at the next
307 | example in, ChatGPT, we can see that this
308 |
309 | is using some examples like before, but
310 | now It's now giving it actual topics and
311 |
312 | here we can see the topics are you know,
313 | these design bureaus, we've got infantry.
314 |
315 | so this is the idea here is
316 | that, it will now generate,
317 |
318 | Examples based on these particular topics.
319 |
320 | and you can see sure enough it does,
321 | That we can see straight away, what
322 |
323 | are the key factors to consider when
324 | choosing a design bureau for your project?
325 |
326 | Have a design bureau has
327 | evolved over the past decade.
328 |
329 | We see like a whole bunch
330 | of different things.
331 |
332 | What are some of the most interesting
333 | characteristics of infantry soldiers?
334 |
335 | So this is getting examples
336 | that are on topic for this.
337 |
338 | They then basically use the GPT-3.5
339 | turbo API to generate responses.
340 |
341 | they then cluster those to see, okay,
342 | how do these compare in diversity
343 |
344 | compared to the original data sets?
345 |
346 | And you can see here, they've basically
347 | got some diagrams of them doing the
348 |
349 | clustering after that, they go on
350 | to basically, train up some models.
351 |
352 | And, run them on a set
353 | of downstream tasks.
354 |
355 | And so they've got some really
356 | interesting ideas here that they
357 |
358 | comparing not only, Similar kinds
359 | of decoder models or GPT models.
360 |
361 | They also compare it to the flan models,
362 | the T5 models, the encoder-decoder models.
363 |
364 | and so they've got some
365 | interesting points here.
366 |
367 | About what they found, you know, which
368 | kinds of models outperform other models.
369 |
370 | So they talk about the encoder decoder
371 | y language models outperform the decoder
372 |
373 | models where, for right when things are
374 | roughly the same parameter size in there.
375 |
376 | they mix, you know, a really amazing
377 | statement that this LaMini Flan
378 |
379 | T5 248, even outperforms LLaMa 7
380 | billion on downstream in NLP tasks.
381 |
382 | So that would be the LLaMa that's
383 | not been fine tuned, obviously.
384 |
385 | so obviously this fine tuning of the
386 | flan is doing better than the raw
387 |
388 | LLaMa, model there.
389 |
390 | they also basically point out,
391 | you know, some things about
392 |
393 | the, GPT, what they calling the
394 | LaMini-GPT, which is the GPT-2 model.
395 |
396 | And, they compare them
397 | to the Cerebras models.
398 |
399 | Turns out that sure enough, just as
400 | we looked at when we looked at the
401 |
402 | Cerebras models in the video that
403 | probably those models weren't that great
404 |
405 | as base models for this kind of thing.
406 |
407 | and that surprisingly GPT-2 seems to
408 | have done better than some of those.
409 |
410 | can we say, you know, generally vanilla
411 | GPT 2 also out performs Cerbras-GPT models
412 |
413 | of comparable size on downstream tasks.
414 |
415 | So let's jump into the code and have a
416 | look at, what these are actually have.
417 |
418 | So, like I mentioned earlier on,
419 | they've released the dataset.
420 |
421 | they've also released a set of different
422 | art models on the hugging face hubs.
423 |
424 | So we've got the some T5s.
425 |
426 | Going from tiny little models of
427 | 61 million, right up to the biggest
428 |
429 | model in here is only the GPT-2
430 | model, which is 1.5 billion model,
431 |
432 | which is very small compared to
433 | what we've been looking at recently.
434 |
435 | So I basically decided to pick
436 | three of these models to try out,
437 |
438 | I've gone for the flan T5, Five.
439 |
440 | the GPT-2 model and the Neo model So if
441 | we jump in and have a look at this, the
442 |
443 | first one I've got here is the Neo model.
444 |
445 | And we've just got some
446 | simple code for doing this.
447 |
448 | This is a decoder only model.
449 |
450 | so we've got the causal language
451 | modeling bringing it in and we're just
452 |
453 | bringing a pipeline of text generation.
454 |
455 | and, I've just written some, little
456 | code to sort of just tidy up the
457 |
458 | responses as it comes out of the model.
459 |
460 | And you can see that okay first off
461 | we can ask it the general sort of
462 |
463 | questions that we've been asking
464 | is what are the difference between
465 |
466 | alpacas, Vicunas, and llamas?
467 |
468 | And you'll see that it's able
469 | to give us some information.
470 |
471 | Now I'm not going to say that
472 | this is, as good as the 7 billion
473 |
474 | models far from it, right?
475 |
476 | This is 1.3 billion, parameters
477 | that we're looking at here.
478 |
479 | so it's, the fact that it's even able
480 | to just stay on topic and stuff like
481 |
482 | that is already a win for some of these.
483 |
484 | what is the capital of England?
485 |
486 | London.
487 |
488 | write a short note to Sam Altman.
489 |
490 | So this one actually does a
491 | pretty decent shot at doing this.
492 |
493 | much better than you'll see then some of
494 | the other models that we look at later on.
495 |
496 | it's definitely getting some
497 | facts and some phrasing off
498 |
499 | though, when we look at this.
500 |
501 | of course, because this has been
502 | trained on a distilled dataset.
503 |
504 | we're going to get the, AI
505 | language model stuff in here.
506 |
507 | unfortunately for some of these you'll
508 | see that it tends to hallucinate
509 |
510 | about some of these things.
511 |
512 | I've done this one twice.
513 |
514 | First time, it didn't do very well.
515 |
516 | Second time, it did a lot better.
517 |
518 | so it is interesting that I probably
519 | don't see that as an AI language model,
520 |
521 | as much as I see it in the, Alpaca
522 | or in the Vicuna models, et cetera.
523 |
524 | my guess is probably just because of the
525 | diversity of the dataset and the size of
526 |
527 | the dataset means that it doesn't see this
528 | as often as those other models or see it.
529 |
530 | let's jump into the next one and
531 | just have a look at this one.
532 |
533 | So this is the GPT 1.5 billion
534 |
535 | LaMini-LM So this is
536 | technically the GPT-2 model.
537 |
538 | that we've got in here.
539 |
540 | Same code, et cetera for running this.
541 |
542 | we can see that this is giving us
543 | responses again, not kind of bad.
544 |
545 | This is a little bit bigger than the
546 | new one, but pretty sure the Neo is
547 |
548 | trained on more tokens than GPT-2.
549 |
550 | It's able to get some of these, right.
551 |
552 | It didn't do a great job at this.
553 |
554 | write a short note to Sam Altman,
555 | giving reasons to open source GPT-4.
556 |
557 | And a number of times when I ran
558 | this, it just said noted or it said,
559 |
560 | okay, I will do that kind of thing.
561 |
562 | here, it's sort of saying to open
563 | source, GPT-4 I recommend giving it
564 |
565 | a try as it has potential benefits
566 | for your team's development efforts.
567 |
568 | You can see that the responses
569 | in here are pretty good though.
570 |
571 | we would not have expected this
572 | maybe from a GPT, something like a
573 |
574 | dialogue GPT or something like that.
575 |
576 | That was based on GPT-2
577 | in the past for these.
578 |
579 | The final one that I look
580 | at is the flan T5 one.
581 |
582 | So this is actually half the
583 | size of the previous one.
584 |
585 | now this is using this is a sequence
586 | to sequence model obviously is an
587 |
588 | encoder decoder model for the T5 models.
589 |
590 | So we need to use the auto, model for
591 | sequence to sequence, LM is in there.
592 |
593 | And we need to change this
594 | to text to text generation.
595 |
596 | Also, I changed a little bit.
597 |
598 | They seem to have changed the
599 | generation out from this a little bit.
600 |
601 | so I've changed the
602 | filtering, function there.
603 |
604 | okay, we can see what are the differences
605 | between alpacas, Vicunas, and llamas.
606 |
607 | it doesn't give us a great answer,
608 | but it's definitely on topic here.
609 |
610 | so you could imagine that we're doing
611 | it sort of general model here, which
612 |
613 | is not how you would use these things.
614 |
615 | These things are more for where
616 | you would basically hone in on one
617 |
618 | domain and train it up on a lot of
619 | data for that one particular domain.
620 |
621 | and get it going for that, so
622 | that's something to think about.
623 |
624 | what is the capital of England?
625 |
626 | The capital of England is London.
627 |
628 | write a short note to Sam Altman,
629 | giving reasons to open source GPT four.
630 |
631 | and then, you know, it's sort of done
632 | something around it when definitely
633 |
634 | not getting the, you know, in depth
635 | responses that we get from over Vicuna
636 |
637 | 13 billion or something But, were
638 | talking about something that's 20
639 |
640 | times smaller, so it's not surprising.
641 |
642 | in this case.
643 |
644 | Asking about Homer.
645 |
646 | So first off.
647 |
648 | it basically does the whole, as
649 | an AI, I don't have the ability
650 |
651 | to like, or dislike things.
652 |
653 | However, I'm knowledgeable about Homer.
654 |
655 | But it's quite sassy that it's not going
656 | to tell us what it knows about Homer.
657 |
658 | So ask it again.
659 |
660 | and then here it gets, oh yeah.
661 |
662 | I'm as a character on TV show Simpson's
663 | who is known for his intelligence.
664 |
665 | I'm not sure that is correct.
666 |
667 | humor and willingness to help those
668 | in need again not sure about that.
669 |
670 | so you can see it just different responses
671 | of that it gets out for those things.
672 |
673 | None of them did well on the reasoning
674 | questions if we if we sort of look at
675 |
676 | these i think it's the The reasoning
677 | when here it's basically saying that you
678 |
679 | would have 16 apples left not accurate.
680 |
681 | although it does get the bit
682 | about a haiku that yes that
683 |
684 | could be in a single tweet four.
685 |
686 | The neo model Again it doesn't do a good
687 | job with the reasoning on the apples but
688 |
689 | it does an okay job with the haiku thing.
690 |
691 | so this one also again not a
692 | great job at counting at working
693 |
694 | out those sorts of things.
695 |
696 | But it does know a little bit
697 | about haikus in this case.
698 |
699 | So anyway The goal here is not that
700 | these are going to replace your
701 |
702 | general purpose of Vicuna or your
703 | Koala model or something like that.
704 |
705 | The goal here is that these are
706 | showing that if you train with a lot
707 |
708 | more data even a small model can get
709 | decent results with this kind of thing.
710 |
711 | So you've got to think that as they
712 | bring out the 7 billion models and
713 |
714 | if they do them the 13 and 30 billion
715 | models, On this 250 million dataset.
716 |
717 | It's going to be interesting to see how
718 | they compare to the full Vicuna models or
719 |
720 | the Stable Vicuna those sorts of models.
721 |
722 | Anyway as always if you've got questions
723 | please put them in the comments if
724 |
725 | you found this useful please click
726 | like and subscribe I will talk to
727 |
728 | you in the next video bye for now
729 |
730 |
731 |
--------------------------------------------------------------------------------
/Optimization/AutomaticTextSummarization.txt:
--------------------------------------------------------------------------------
1 | Title: Automatic Text Summarization with Machine Learning — An overview | by Luís Gonçalves | luisfredgs | Medium
2 | -----------------------------------------------------------------------------------------------------------------
3 |
4 | Write
5 | 3
6 |
7 | Top highlight
8 |
9 | Automatic Text Summarization with Machine Learning — An overview
10 |
11 | Luís Gonçalves
12 |
13 | ·
14 |
15 | Follow
16 |
17 | Published in
18 |
19 | luisfredgs
20 |
21 | ·
22 | 11 min read
23 | ·
24 | Apr 11, 2020
25 |
26 | 796
27 |
28 | 11
29 |
30 | Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.
31 |
32 | There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation. Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.
33 |
34 | In the big data era, there has been an explosion in the amount of text data from a variety of sources. This volume of text is an inestimable source of information and knowledge which needs to be effectively summarized to be useful. This increasing availability of documents has demanded exhaustive research in the NLP area for automatic text summarization. Automatic text summarization is the task of producing a concise and fluent summary without any human help while preserving the meaning of the original text document.
35 |
36 | It is very challenging, because when we as humans summarize a piece of text, we usually read it entirely to develop our understanding, and then write a summary highlighting its main points. Since computers lack human knowledge and language capability, it makes automatic text summarization a very difficult and non-trivial task.
37 |
38 | Various models based on machine learning have been proposed for this task. Most of these approaches model this problem as a classification problem which outputs whether to include a sentence in the summary or not. Other approaches have used topic information, Latent Semantic Analysis (LSA), Sequence to Sequence models, Reinforcement Learning and Adversarial processes.
39 |
40 | In general, there are two different approaches for automatic summarization: extraction and abstraction.
41 |
42 | The extractive approach
43 |
44 | Extractive summarization picks up sentences directly from the document based on a scoring function to form a coherent summary. This method work by identifying important sections of the text cropping out and stitch together portions of the content to produce a condensed version.
45 |
46 | Extractive summarization work by identifying important sections of the text cropping out and stitch together portions of the content to produce a condensed version. Thus, they depend only on the extraction of sentences from the original text.
47 |
48 | Thus, they depend only on the extraction of sentences from the original text. Most of the summarization research today has focused on extractive summarization, once it is easier and yields naturally grammatical summaries requiring relatively little linguistic analysis. Moreover, extractive summaries contain the most important sentences of the input, which can be a single document or multiple documents.
49 |
50 | A typical flow of extractive summarization systems consists of:
51 |
52 | 1. Constructs an intermediate representation of the input text intending to find salient content. Typically, it works by computing TF metrics for each sentence in the given matrix.
53 |
54 | 2. Scores the sentences based on the representation, assigning a value to each sentence denoting the probability with which it will get picked up in the summary.
55 |
56 | 3. Produces a summary based on the top k most important sentences. Some studies have used Latent semantic analysis (LSA) to identify semantically important sentences.
57 |
58 | For a good starting point to the LSA models in summarization, check this paper and this one. An implementation of LSA for extractive text summarization in Python is available in this github repo. For example, I used this code to make the following summary:
59 |
60 | Original text:
61 |
62 | De acordo com o especialista da Certsys (empresa que tem trabalhado na implementação e alteração de fluxos desses robôs), Diego Howës, as empresas têm buscado incrementar os bots de atendimento ao público interno com essas novas demandas de prevenção, para que os colaboradores possam ter à mão informações sobre a doença, tipos de cuidado, boas práticas de higiene e orientações gerais sobre a otimização do home office. Já os negócios que buscam se comunicar com o público externo enxergam outras necessidades. “Temos clientes de varejo que pediram para que fossem criados novos fluxos abordando o tema, e informando aos consumidores que as entregas dos produtos adquiridos online podem sofrer algum atraso”, comenta Howës, da Certsys, que tem buscado ampliar o escopo desses canais para se adequar ao momento de atenção. Ainda segundo o especialista, em todo o mercado é possível observar uma tendência de automatização do atendimento à população, em busca de chatbots que trabalhem em canais de alto acesso, como o WhatsApp, no caso de órgãos públicos. Na área de saúde, a disseminação de informação sobre a pandemia do vírus tem sido um esforço realizado.
63 |
64 | Summarized text:
65 |
66 | De acordo com o especialista da Certsys (empresa que tem trabalhado na implementação e alteração de fluxos desses robôs), Diego Howës, as empresas têm buscado incrementar os bots de atendimento ao público interno com essas novas demandas de prevenção, para que os colaboradores possam ter à mão informações sobre a doença, tipos de cuidado, boas práticas de higiene e orientações gerais sobre a otimização do home office. Já os negócios que buscam se comunicar com o público externo enxergam outras necessidades. Na área de saúde, a disseminação de informação sobre a pandemia do vírus tem sido um esforço realizado.
67 |
68 | Recent studies have applied deep learning in extractive summarization as well. For instance, Sukriti proposes an extractive text summarization approach for factual reports using a deep learning model, exploring various features to improve the set of sentences selected for the summary.
69 |
70 | Yong Zhang proposed a document summarization framework based on convolutional neural networks to learn sentence features and perform sentence ranking jointly using a CNN model for sentence ranking. The authors adapt the original classification model of Y. Kim to address a regression process for sentence ranking. The neural architecture used in that paper is compound by one single convolution layer that is built on top of the pre-trained word vectors followed by a max-pooling layer. The author carried experiments on both single and multi-document summarization tasks to evaluate the proposed model. Results have shown the method achieved competitive or even better performance compared with baselines. The source code used in experiments can be found here.
71 |
72 | Abstractive summarization
73 |
74 | Abstractive summarization methods aim at producing summary by interpreting the text using advanced natural language techniques in order to generate a new shorter text — parts of which may not appear as part of the original document, that conveys the most critical information from the original text, requiring rephrasing sentences and incorporating information from full text to generate summaries such as a human-written abstract usually does. In fact, an acceptable abstractive summary covers core information in the input and is linguistically fluent.
75 |
76 | Thus, they are not restricted to simply selecting and rearranging passages from the original text.
77 |
78 | Abstractive methods take advantage of recent developments in deep learning. Since it can be regarded as a sequence mapping task where the source text should be mapped to the target summary, abstractive methods take advantage of the recent success of the sequence to sequence models. These models consist of an encoder and a decoder, where a neural network reads the text, encodes it, and then generates target text.
79 |
80 | In general, building abstract summaries is a challenging task, which is relatively harder than data-driven approaches such as sentence extraction and involves complex language modeling. Thus, they are still far away from reaching human-level quality in summary generation, despite recent progress using neural networks inspired by the progress of neural machine translation and sequence to sequence models.
81 |
82 | An example is the work of Alexander et al, which proposed a neural attention model for abstractive sentence summarization (NAMAS) by exploring a fully data-driven approach for generating abstractive summaries using an attention-based encoder-decoder method. Attention mechanism has been broadly used in sequence to sequence models where the decoder extracts information from the encoder based on the attention scores on the source-side information. The code to reproduce the experiments from the NAMAS paper can be found here.
83 |
84 | Example output of the attention-based summarization of Alexander et al. The heatmap represents a soft alignment between the input (right) and the generated summary (top). The columns represent the distribution over the input after generating each word.
85 |
86 | Recent studies have argued attention-based sequence to sequence models for abstractive summarization can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text. Junyang Lin et al propose to implement a gated unit on top of the encoder outputs at each time step, which is a CNN that convolves all the encoder outputs, in order to tackle this problem.
87 |
88 | Based on the convolution and self-attention of Vaswani et al., a convolutional gated unit sets a gate to filter the source annotations from the RNN encoder, in order to select information relevant to the global semantic meaning. In other words, it refines the representation of the source context with a CNN to improve the connection of the word representation with the global context. Their model is capable of reducing repetition compared with the sequence to sequence model outperforming the state-of-the-art methods. The source code of paper can be found here.
89 |
90 | Other methods for abstractive summarization have borrowed the concepts from the pointer network of Vinyals et al to addresses the undesirable behavior of sequence to sequence models. Pointer Network is a neural attention-based sequence-to-sequence architecture that learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.
91 |
92 | For example, Abigail See et al. presented an architecture called Pointer-Generator, which allows copying words from the input sequence via pointing of specific positions, whereas a generator allows generating words from a fixed vocabulary of 50k words. The architecture can be viewed as a balance between extractive and abstractive approaches.
93 |
94 | In order to overcome the repetition problems, the paper adapts the coverage model of Tu et al., which was proposed to overcome the lacking coverage of source words in neural machine translation models. Specifically, Abigail See et al. defined a flexible coverage loss to penalize repeatedly attending to the same locations, only penalizing the overlap between each attention distribution and the coverage up to the current time step helping to prevents repeated attention. The source code for the model can be found here.
95 |
96 | The Pointer-generator model. For each timestep in the decoder, the probability of generating words from the fixed vocabulary, versus copying words from source using a pointer is weighted by a generation probability p_{gen}. The vocabulary distribution and attention distribution are weighted and summed to obtain the final distribution. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to generate the next word. It is used to produce a weighted sum of the encoder hidden states, known as the context vector.
97 |
98 | Other studies in abstractive summarization have borrowed the concepts from the reinforcement learning (RL) field to improve model accuracy. For example, Chen et al. proposed a hybrid extractive-abstractive architecture using two neural networks in a hierarchical way, that selects salient sentences using an RL guided extractor from the source and then rewrites them abstractively to generate a summary.
99 |
100 | In other words, the model simulates how humans summarize long documents first using an extractor agent to select salient sentences or highlights, and then employs an abstractor — an encoder-aligner-decoder model — network to rewrite each of these extracted sentences. To train the extractor on available document-summary pairs, the model uses a policy-based reinforcement learning (RL) with sentence-level metric rewards to connect both extractor and abstractor networks and to learn sentence saliency.
101 |
102 | Reinforced training of the extractor (for one extraction step) and its interaction with the abstractor.
103 |
104 | The abstractor network is an attention-based encoder-decoder which compresses and paraphrases an extracted document sentence to a concise summary sentence. Moreover, the abstractor has a useful mechanism to help directly copy some out-of-vocabulary (OOV) words.
105 |
106 | The convolutional extractor agent
107 |
108 | The extractor agent is a convolutional sentence encoder that computes representations for each sentence based on input embedded word vectors. Further, an RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. Once the sentence is selected, the context-aware representation will be fed into the decoder at time t + 1.
109 |
110 | Thus, the method incorporates the abstractive approach advantages of concisely rewriting sentences and generating novel words from the full vocabulary, whereas adopts intermediate extractive behavior to improve the overall model’s quality, speed, and stability. The author argued model training is 4x faster than the previous state-of-the-art. Both source code and best pre-trained models were released to promote future research.
111 |
112 | Other recent studies have proposed using a combination of the adversarial processes and reinforcement learning to abstractive summarization. An example is Liu et al. (2017), whose work proposes an adversarial framework to jointly train a generative model and a discriminative model similar to Goodfellow et al. (2014). In that framework, a generative model takes the original text as input and generates the summary using reinforcement learning to optimize the generator for a highly rewarded summary. Further, a discriminator model tries to distinguish the ground truth summaries from the generated summaries by the generator.
113 |
114 | The discriminator is implemented as a text classifier that learns to classify the generated summaries as machine or human-generated, while the training procedure of generator is to maximize the probability of discriminator making a mistake. The idea is this adversarial process can eventually let the generator to generate plausible and high-quality abstractive summaries. The author provided supplementary material here. The source code is available in this github repo.
115 |
116 | In short
117 |
118 | Automatic text summarization is an exciting research area with several applications on the industry. By condensing large quantities of information into short, summarization can aid many downstream applications such as creating news digests, report generation, news summarization, and headline generation. There are two prominent types of summarization algorithms.
119 |
120 | First, extractive summarization systems form summaries by copying and rearranging passages from the original text. Second, abstractive summarization systems generate new phrases, rephrasing or using words that were not in the original text. Due to the difficulty of abstractive summarization, the great majority of past work has been extractive.
121 |
122 | The extractive approach is easier because copying large chunks of text from the source document ensures good levels of grammaticality and accuracy. On the other hand, sophisticated abilities that are crucial to high-quality summarization, such as paraphrasing, generalization, or the incorporation of real-world knowledge, are possible only in an abstractive framework. Even though abstractive summarization is a more challenging task, there has been a number of advances so far, thanks to recent developments in the deep learning area.
123 |
124 | Cite as
125 | @misc{luisfredgs2020,
126 | title = "Automatic Text Summarization with Machine Learning — An overview",
127 | author = "Gonçalves, Luís",
128 | year = "2020",
129 | howpublished = {https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-overview-68ded5717a25},
130 | }
131 | References
132 |
133 | 1. Extractive Text Summarization using Neural Networks — Sinha et al.(2018)
134 | 2. Extractive document summarization based on convolutional neural networks — Y. Zhang et al. (2016)
135 | 3. A Neural Attention Model for Abstractive Sentence Summarization — Rush et al.(2015)
136 | 4. Global Encoding for Abstractive Summarization — Lin et al.(2018)
137 | 5. Summarization with Pointer-Generator Networks — See et al.(2017)
138 | 6. Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting — Chen and Bansal(2018)
139 | 7. Generative Adversarial Network for Abstractive Text Summarization — Liu et al.(2017)
140 | 8. Using Latent Semantic Analysis in Text Summarization and Summary Evaluation — Josef Steinberger and Karel Jezek. (2003)
141 | 9. Text summarization using Latent Semantic Analysis — Makbule et al. (2011)
142 |
143 | Endnote: English is not my native language. So, let me know if you have found any errors in the text. I will be grateful if you can leave your feedback at comments section. Besides, leave a few claps if you found this text helpful!
144 |
145 | Paper Review
146 | NLP
147 | Text Summarization
148 | Deep Learning
149 | Machine Learning
150 |
151 | 796
152 |
153 | 11
154 |
155 | Written by Luís Gonçalves
156 | 364 Followers
157 | ·
158 | Editor for
159 |
160 | luisfredgs
161 |
162 | Machine Learning Researcher and PhD student in Computer Science at Universidade Federal de Pernambuco, Brazil.
163 |
164 | Follow
165 | More from Luís Gonçalves and luisfredgs
166 |
167 | Luís Gonçalves
168 |
169 | in
170 |
171 | luisfredgs
172 |
173 | Multiple Classifier Systems — a brief introduction
174 | The core idea behind the ensemble methodology is to aggregate multiple models to obtain a combined model that outperforms every single…
175 | 14 min read
176 | ·
177 | Apr 25, 2021
178 |
179 | 19
180 |
181 | 1
182 |
183 | Luís Gonçalves
184 |
185 | in
186 |
187 | luisfredgs
188 |
189 | Machine Reading Comprehension — inteligência artificial que consegue ler e interpretar textos
190 | Quando eu era um aluno do ensino fundamental, os professores frequentemente me encarregavam de ler um texto enorme para em seguida me…
191 | 8 min read
192 | ·
193 | Aug 5, 2018
194 |
195 | 11
196 |
197 | 2
198 |
199 | Luís Gonçalves
200 |
201 | in
202 |
203 | luisfredgs
204 |
205 | Mestrado em Ciência da Computação, na área de Machine Learning, que tal tentar?
206 | Nesse post eu conto um pouco sobre os motivos que me levaram a cursar um mestrado acadêmico em Machine Learning, como tem sido minha…
207 | 14 min read
208 | ·
209 | Aug 21, 2018
210 |
211 | 271
212 |
213 | 2
214 |
215 | Luís Gonçalves
216 |
217 | in
218 |
219 | luisfredgs
220 |
221 | Classificando textos com Machine Learning
222 | A classificação de textos é uma das mais importantes aplicações do processamento de linguagem natural hoje em dia. É o que possibilita…
223 | 5 min read
224 | ·
225 | May 7, 2018
226 |
227 | 44
228 |
229 | See all from Luís Gonçalves
230 | See all from luisfredgs
231 | Recommended from Medium
232 |
233 | Ajazahmed Turki
234 |
235 | Text Summarization with T5, PyTorch, and PyTorch Lightning
236 | Text summarization is the process of extracting meaningful short sentences from larger bodies using deep learning models.
237 | 3 min read
238 | ·
239 | Dec 21, 2022
240 |
241 | 33
242 |
243 | 1
244 |
245 | Jose Diaz
246 |
247 | Break the Limits: Send Large Text Blocks to ChatGPT with Ease
248 | ❓ Have you ever received a message from ChatGPT about sending too much data and needing to send a shorter text?
249 | 4 min read
250 | ·
251 | Mar 18
252 |
253 | 60
254 |
255 | 6
256 |
257 | Lists
258 | What is ChatGPT?
259 | 9 stories
260 | ·
261 | 50 saves
262 | Staff Picks
263 | 323 stories
264 | ·
265 | 81 saves
266 |
267 | Ali Issa
268 |
269 | Transformer, GPT-3,GPT-J, T5 and BERT.
270 | In recent years, machine learning (ML) has made tremendous strides in advancing the field of natural language processing (NLP). Among the…
271 | 10 min read
272 | ·
273 | Jan 26
274 |
275 | 102
276 |
277 | Prasenjeet Acharjee
278 |
279 | Beyond Attention: Unleashing the Power of Transformer Models
280 | Hello, fellow learners! If you’ve been following my previous posts, you’re already familiar with the attention mechanism, a significant…
281 | 9 min read
282 | ·
283 | May 17
284 |
285 | 57
286 |
287 | dp
288 |
289 | Complete Custom Named-Entity Recognition (NER) to Relation Extraction (RE) Pipeline
290 | Complete walk-through where we tie custom NER and RE Models together in order to extract entities and relations from text.
291 | 3 min read
292 | ·
293 | 6 days ago
294 |
295 | 135
296 |
297 | Nadira Povey
298 |
299 | ChatGPT: Abstractive Text Summarization
300 | The latest best project in ML — ChatGPT — was introduce this month. As I am currently working on Abstractive Text Summarization model I put…
301 | 4 min read
302 | ·
303 | Dec 11, 2022
304 |
305 | 5
306 |
307 | See more recommendations
308 |
309 | Help
310 |
311 | Status
312 |
313 | Writers
314 |
315 | Blog
316 |
317 | Careers
318 |
319 | Privacy
320 |
321 | Terms
322 |
323 | About
324 |
325 | Text to speech
--------------------------------------------------------------------------------