├── LangchainURLloader_error.jpg
├── logo.png
├── Headline.jpg
├── Headline-text.jpg
├── README.md
├── requirements.txt
├── my_data
├── 2302.13971.pdf
├── 2305.11206.pdf
├── GPT4VsLima.txt
├── extractive-vs-abstractive-summarization-in-healthcare.txt
├── Derivative Tuning for PID Control.txt
├── nlp-basics-abstractive-and-extractive-text-summarization.txt
└── BERTexplanation.txt
├── working
├── youtubelogo.jpg
├── yt.py
├── video_summarization.txt
├── testState.py
├── dataf.py
├── YoutubeSummarizer.py
├── Mon_May_29_11-46-01_2023.svg
├── video_transcript.txt
└── testst.py
├── koreanTranslation_bad.jpg
├── langChain_HFpipeline.jpg
├── ItalianTranslation_good.jpg
├── imageplaceholder1172x368.jpg
├── Langchain-API-modules-LLMs.jpg
├── huggingface-cli_scan-cache.jpg
├── summarization_index
├── index.pkl
└── index.faiss
├── koreanTranslation_KOBART_bad.jpg
├── rich-readthedocs-io-en-stable.pdf
├── movedir.bat
├── fidnCachedModels.md
├── .gitignore
├── st-basic.py
├── filelist.txt
├── opus_en-ko_instructions.md
├── Model_kr_source.md
├── W9EP8sqMHDA-video_summarization.txt
├── JrapOij3Mtk-video_summarization.txt
├── st-headers-title-textarea-bt.py
├── on MediumRepo
├── LaMini-TextSummarizer_mockup.py
├── main.py
└── LaMini-TextSummarizer.py
├── main.py
├── video_summarization.txt
├── lamini.py
├── lamini-url.py
├── st-yourTranslationApp.py
├── W9EP8sqMHDA-video_transcript.txt
├── text_summarization.txt
├── pp.py
├── test-translation.py
├── 750x250.svg
├── LaMini-TextSummarizer.py
├── test-translation_en_to_kr.py
├── 1172x368.svg
├── video_transcript.txt
├── st-LaMini-YoutubeSummarizer.py
├── LaMini-YoutubeSummarizer.py
├── JrapOij3Mtk-video_transcript.txt
└── Installation_Instructions.md
/LangchainURLloader_error.jpg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/logo.png
--------------------------------------------------------------------------------
/Headline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/Headline.jpg
--------------------------------------------------------------------------------
/Headline-text.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/Headline-text.jpg
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LaMini-Medium
2 | Repo of the code from the Medium article LaMini-LM on your local PC
3 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/requirements.txt
--------------------------------------------------------------------------------
/my_data/2302.13971.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/my_data/2302.13971.pdf
--------------------------------------------------------------------------------
/my_data/2305.11206.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/my_data/2305.11206.pdf
--------------------------------------------------------------------------------
/working/youtubelogo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/working/youtubelogo.jpg
--------------------------------------------------------------------------------
/koreanTranslation_bad.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/koreanTranslation_bad.jpg
--------------------------------------------------------------------------------
/langChain_HFpipeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/langChain_HFpipeline.jpg
--------------------------------------------------------------------------------
/ItalianTranslation_good.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/ItalianTranslation_good.jpg
--------------------------------------------------------------------------------
/imageplaceholder1172x368.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/imageplaceholder1172x368.jpg
--------------------------------------------------------------------------------
/Langchain-API-modules-LLMs.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/Langchain-API-modules-LLMs.jpg
--------------------------------------------------------------------------------
/huggingface-cli_scan-cache.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/huggingface-cli_scan-cache.jpg
--------------------------------------------------------------------------------
/summarization_index/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/summarization_index/index.pkl
--------------------------------------------------------------------------------
/koreanTranslation_KOBART_bad.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/koreanTranslation_KOBART_bad.jpg
--------------------------------------------------------------------------------
/rich-readthedocs-io-en-stable.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/rich-readthedocs-io-en-stable.pdf
--------------------------------------------------------------------------------
/summarization_index/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/LaMini-Medium/master/summarization_index/index.faiss
--------------------------------------------------------------------------------
/movedir.bat:
--------------------------------------------------------------------------------
1 | move .gitattributes \model
2 | move .gitignore \model
3 | move README.md \model
4 | move config.json \model
5 | move generation_config.json \model
6 | move pytorch_model.bin \model
7 | move special_tokens_map.json \model
8 | move spiece.model \model
9 | move tokenizer.json \model
10 | move tokenizer_config.json \model
11 | move training_args.bin \model
12 |
--------------------------------------------------------------------------------
/fidnCachedModels.md:
--------------------------------------------------------------------------------
1 | ```
2 | (venv) PS C:\Users\fmatricard\Videos\LaMiniLocal> huggingface-cli scan-cache
3 | REPO ID REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH
4 |
5 | -------------------------- --------- ------------ -------- -------------- -------------- ---- ------------------------------------------------------------------------------
6 | Helsinki-NLP/opus-mt-en-it model 346.9M 7 18 minutes ago 18 minutes ago main C:\Users\fmatricard\.cache\huggingface\hub\models--Helsinki-NLP--opus-mt-en-it
7 |
8 | Done in 0.0s. Scanned 1 repo(s) for a total of 346.9M.
9 | Got 2 warning(s) while scanning. Use -vvv to print details.
10 | ```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Compiled source #
2 | ###################
3 | *.com
4 | *.class
5 | *.dll
6 | *.exe
7 | *.o
8 | *.so
9 |
10 | # Packages #
11 | ############
12 | # it's better to unpack these files and commit the raw source
13 | # git has its own built in compression methods
14 | *.7z
15 | *.dmg
16 | *.gz
17 | *.iso
18 | *.jar
19 | *.rar
20 | *.tar
21 | *.zip
22 |
23 | # Logs and databases #
24 | ######################
25 | *.log
26 | *.sql
27 | *.sqlite
28 |
29 | # OS generated files #
30 | ######################
31 | .DS_Store
32 | .DS_Store?
33 | ._*
34 | .Spotlight-V100
35 | .Trashes
36 | ehthumbs.db
37 | Thumbs.db
38 |
39 | # virtual env directory #
40 | #########################
41 | /venv/
42 | /model/
43 | /model_kr/
44 | /opus-en-ko/
45 | /model_it/
--------------------------------------------------------------------------------
/st-basic.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | ############# Displaying images on the front end #################
3 | st.set_page_config(page_title="Mockup for single page webapp",
4 | page_icon='💻',
5 | layout="centered", #or wide
6 | initial_sidebar_state="expanded",
7 | menu_items={
8 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
9 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
10 | 'About': "# This is a header. This is an *extremely* cool app!"}
11 | )
12 | # Load image placeholder from the web
13 | st.image('https://placehold.co/1172x368', width=750)
14 | st.divider()
15 | # load image from local disk
16 | st.image('Headline-text.jpg', width=750)
17 | st.divider()
18 | st.image('750x250.svg', width=750)
--------------------------------------------------------------------------------
/filelist.txt:
--------------------------------------------------------------------------------
1 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/.gitattributes
2 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/.gitignore
3 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/README.md
4 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/config.json
5 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/generation_config.json
6 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/pytorch_model.bin
7 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/special_tokens_map.json
8 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/spiece.model
9 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/tokenizer.json
10 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/tokenizer_config.json
11 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/training_args.bin
--------------------------------------------------------------------------------
/opus_en-ko_instructions.md:
--------------------------------------------------------------------------------
1 | # MODEL FOR KOREAN TRANSLATIONS
2 | https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-ko/tree/main
3 | repo_id = 'Helsinki-NLP/opus-mt-tc-big-en-ko'
4 |
5 | ### Use in Transformers
6 | ```python
7 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
8 |
9 | tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tc-big-en-ko")
10 |
11 | model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-tc-big-en-ko")
12 | ```
13 |
14 |
15 | ### MODEL FOR KOREAN TRANSLATIONS AND SUMMARIZATION
16 | BAD
17 | repo_id = 'gogamza/kobart-base-v2'
18 | ---
19 | files and model card:
20 | https://huggingface.co/gogamza/kobart-base-v2/tree/main
21 |
22 | GitHub repo with examples: https://github.com/haven-jeon/KoBART
23 |
24 | KoBart translations Inference
25 | https://github.com/seujung/KoBART-translation/blob/main/infer.py
26 | KoBart SUMMARIZATION
27 | https://github.com/seujung/KoBART-summarization
28 |
--------------------------------------------------------------------------------
/Model_kr_source.md:
--------------------------------------------------------------------------------
1 | # Korean model
2 | This is used ONLY for translations from English to Korean (Hangul)
3 |
4 | source: https://huggingface.co/hcho22/opus-mt-ko-en-finetuned-en-to-kr
5 | reo_id = "hcho22/opus-mt-ko-en-finetuned-en-to-kr"
6 |
7 | NOTE
8 | tensorflow is required
9 | the modiule has extension h5
10 | ```
11 | pip install tensorflow
12 | Successfully installed absl-py-1.4.0 astunparse-1.6.3 flatbuffers-23.5.26 gast-0.4.0 google-auth-2.19.1 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.54.2 h5py-3.8.0 jax-0.4.11 keras-2.12.0 libclang-16.0.0 ml-dtypes-0.1.0 oauthlib-3.2.2 opt-einsum-3.3.0 pyasn1-0.5.0 pyasn1-modules-0.3.0 requests-oauthlib-1.3.1 rsa-4.9 tensorboard-2.12.3 tensorboard-data-server-0.7.0 tensorflow-2.12.0 tensorflow-estimator-2.12.0 tensorflow-intel-2.12.0 tensorflow-io-gcs-filesystem-0.31.0 termcolor-2.3.0 urllib3-1.26.16 werkzeug-2.3.4 wheel-0.40.0
13 | ```
14 | INCLUDE in the IMPORT for tokenizer `from_tf=True`
15 | ```
16 | model_ttKR = AutoModelForSeq2SeqLM.from_pretrained(Model_KR, from_tf=True)
17 | ```
18 |
19 | Some tests examples
--------------------------------------------------------------------------------
/working/yt.py:
--------------------------------------------------------------------------------
1 | from pytube import YouTube as YT
2 | import ssl
3 | import datetime
4 |
5 | ########### SSL FOR PROXY ##############
6 | ssl._create_default_https_context = ssl._create_unverified_context
7 |
8 | myvideo = YT('https://youtu.be/riXpu1tHzl0', use_oauth=True, allow_oauth_cache=True)
9 | # required only for the first time to know what languages are aavailable
10 | print(myvideo.title)
11 | print(myvideo.captions) #print the options of languages available
12 | #Commented for automatically choice to auto-generated
13 | #code = input("input the code you want: ") #original
14 | print("Scraping subtitles...")
15 | #Commented to test the auto-generated ones
16 | #sub = myvideo.captions[code] #original
17 | print(myvideo.captions)
18 | sub = myvideo.captions['a.en']
19 | print(sub)
20 | caption = sub.generate_srt_captions()
21 | import datetime
22 | m1 = f"TITLE: {myvideo.title}"+'\n'
23 | m2 = f"thumbnail url: {myvideo.thumbnail_url}"+'\n'
24 | m4 = f"video Duration: {str(datetime.timedelta(seconds=myvideo.length))}"+'\n'
25 | m5 = "----------------------------------------"+'\n'
26 | #m6 = textwrap.fill(myvideo.description, 80)+'\n' #solution not good
27 | m6 = myvideo.description+'\n'
28 | m7 = "----------------------------------------"+'\n'
29 |
--------------------------------------------------------------------------------
/W9EP8sqMHDA-video_summarization.txt:
--------------------------------------------------------------------------------
1 | The video titled "Opensource Supremacy? | OpenLLAMA: The Revenge of Opensource?" has a duration of 0:05:17. The article discusses the release of Open Llama, an open source reproduction of Meta AI's popular llama large language models. It is permissively licensed and is being developed by researchers at Berkeley AI research to understand why this work is important. The model architecture is new and has a longer context length of at least 1496 tokens, beating the state of the art on benchmarks. However, the training data is only about 20 percent done, which is only one-fifth of the total for the 7 billion parameter model. The preliminary evaluation is promising, but further conclusions can only be temporary. The training for 200 billion parameters seems almost horizontal, but it is likely an artifact of the plotting scale or something else. The future looks bright as seen on the chat language model. The line appears almost horizontal with 200 billion parameters, similar to a loss curve in the Llama paper. Further training is possible beyond the 1.4 trillion tokens. The future of Open Llamas looks bright, with the most popular chat language models highlighted in yellow. The author is looking forward to open Llama becoming a free alternative model for these models and releasing their research code and weights to the world. Thank you.
--------------------------------------------------------------------------------
/working/video_summarization.txt:
--------------------------------------------------------------------------------
1 | The Meta AI LIMA video is GroundBreaking!!! with a duration of 0:12:24. Facebook's AI paper called Lima is based on the Lama language model and is fine-tuned with a standard supervised loss to better align to the end tasks and user preference. The model is carefully curated and can perform complex queries without any reinforcement learning or human preference modeling. This is a significant part of the paper as a model's knowledge and capabilities are learned entirely. The paper discusses the importance of pre-training and alignment in language models, as well as the comparison of Lima's language model to GPT4 and Darwin's C 003. The paper also mentions the use of community question and answering Wiki data, push shift rated data set, and human evaluation and preference evaluation comparison. The results of the study are interesting and provide information about how to move past a crush and find healthy ways to cope. The paper discusses how Lima's response to a crush on a guy in a serious relationship can provide valuable information about how to break up and cope with feelings. It also highlights the mental effort involved in constructing such examples and how it can be difficult to scale up. Additionally, the paper highlights the potential of tackling the complex issue of alignment with a simple approach. The paper concludes that Lima less is more for alignment. The evidence presented in this paper shows the potential of tackling the complex issue of alignment with a simple approach. The paper suggests that Lima less is more for alignment compared to large data sets, fine tuning like alpacas, and enhancing the model with 52 000 instructions. Thank you for sharing this detail. I will link this paper in the YouTube description. I would love to hear your thoughts on this paper.
--------------------------------------------------------------------------------
/JrapOij3Mtk-video_summarization.txt:
--------------------------------------------------------------------------------
1 | The Meta AI LIMA video is GroundBreaking!!! with a duration of 0:12:24. Facebook's AI paper called Lima is based on the Lama language model and is fine-tuned with a standard supervised loss to better align to the end tasks and user preference. The model is carefully curated and can perform complex queries without any reinforcement learning or human preference modeling. This is a significant part of the paper as a model's knowledge and capabilities are learned entirely. The paper discusses the importance of pre-training and alignment in language models, as well as the comparison of Lima's language model to GPT4 and Darwin's C 003. The paper also mentions the use of community question and answering Wiki data, push shift rated data set, and human evaluation and preference evaluation comparison. The results of the study are interesting and provide information about how to move past a crush and find healthy ways to cope. The paper discusses how Lima's response to a crush on a guy in a serious relationship can provide valuable information about how to break up and cope with feelings. It also highlights the mental effort involved in constructing such examples and how it can be difficult to scale up. Additionally, the paper highlights the potential of tackling the complex issue of alignment with a simple approach. The paper concludes that Lima less is more for alignment. The evidence presented in this paper shows the potential of tackling the complex issue of alignment with a simple approach. The paper suggests that Lima less is more for alignment compared to large data sets, fine tuning like alpacas, and enhancing the model with 52 000 instructions. Thank you for sharing this detail. I will link this paper in the YouTube description. I would love to hear your thoughts on this paper.
--------------------------------------------------------------------------------
/st-headers-title-textarea-bt.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | ############# Displaying images on the front end #################
3 | st.set_page_config(page_title="Mockup for single page webapp",
4 | page_icon='💻',
5 | layout="centered", #or wide
6 | initial_sidebar_state="expanded",
7 | menu_items={
8 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
9 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
10 | 'About': "# This is a header. This is an *extremely* cool app!"}
11 | )
12 | # Load image placeholder from the web
13 | st.image('https://placehold.co/750x150', width=750)
14 | # Set a Descriptive Title
15 | st.title("Your Beautiful App Name")
16 | st.divider()
17 | your_future_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras rhoncus massa sit amet est congue dapibus. Duis dictum ac nulla sit amet sollicitudin. In non metus ac neque vehicula egestas. Vestibulum quis justo id enim vestibulum venenatis. Cras gravida ex vitae dignissim suscipit. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis efficitur, lorem ut fringilla commodo, lacus orci lobortis turpis, sit amet consequat ante diam ut libero."
18 | st.text_area('Summarized text', your_future_text, height = 150, key = 'result')
19 |
20 | #col1, col0, col2 = st.columns(3) #for 3 columns even distribution
21 | col1, col2 = st.columns(2)
22 | btn1 = col1.button(" :star: Click ME ", use_container_width=True, type="secondary")
23 | btn2 = col2.button(" :smile: Click ME ", use_container_width=True, type="primary")
24 |
25 | if btn1:
26 | st.warning('You pressed the wrong one!', icon="⚠️")
27 | if btn2:
28 | st.success('Good Choice!', icon="⚠️")
29 | st.divider()
--------------------------------------------------------------------------------
/on MediumRepo/LaMini-TextSummarizer_mockup.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | ############# Displaying images on the front end #################
3 | st.set_page_config(page_title="Mockup for single page webapp",
4 | page_icon='💻',
5 | layout="centered", #or wide
6 | initial_sidebar_state="expanded",
7 | menu_items={
8 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
9 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
10 | 'About': "# This is a header. This is an *extremely* cool app!"}
11 | )
12 | # Load image placeholder from the web
13 | st.image('https://placehold.co/750x150', width=750)
14 | # Set a Descriptive Title
15 | st.title("Your Beautiful App Name")
16 | st.divider()
17 | your_future_text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras rhoncus massa sit amet est congue dapibus. Duis dictum ac nulla sit amet sollicitudin. In non metus ac neque vehicula egestas. Vestibulum quis justo id enim vestibulum venenatis. Cras gravida ex vitae dignissim suscipit. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Duis efficitur, lorem ut fringilla commodo, lacus orci lobortis turpis, sit amet consequat ante diam ut libero."
18 | st.text_area('Summarized text', your_future_text, height = 150, key = 'result')
19 |
20 | #col1, col0, col2 = st.columns(3) #for 3 columns even distribution
21 | col1, col2 = st.columns(2)
22 | btn1 = col1.button(" :star: Click ME ", use_container_width=True, type="secondary")
23 | btn2 = col2.button(" :smile: Click ME ", use_container_width=True, type="primary")
24 |
25 | if btn1:
26 | st.warning('You pressed the wrong one!', icon="⚠️")
27 | if btn2:
28 | st.success('Good Choice!', icon="⚠️")
29 | st.divider()
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
2 | from transformers import pipeline
3 | import torch
4 |
5 |
6 | #############################################################################
7 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
8 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
9 | # ###########################################################################
10 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
11 |
12 | from rich import console
13 | from rich.console import Console
14 | from rich.panel import Panel
15 | from rich.text import Text
16 | from functools import reduce
17 | from itertools import chain
18 | from datetime import datetime
19 |
20 | console = Console()
21 |
22 | console.print("[bold yellow]Preparing the LaMini Model...")
23 | tokenizer = AutoTokenizer.from_pretrained(checkpoint)
24 | base_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
25 | device_map='auto',
26 | torch_dtype=torch.float32)
27 |
28 | pipe = pipeline('text2text-generation',
29 | model = base_model,
30 | tokenizer = tokenizer,
31 | max_length = 512,
32 | do_sample=True,
33 | temperature=0.3,
34 | top_p=0.95,
35 | )
36 |
37 |
38 | """### The prompt & response"""
39 |
40 | import textwrap
41 | response = ''
42 | instruction = console.input("Ask LaMini: ")
43 | start = datetime.now()
44 | console.print("[red blink]Executing...")
45 | console.print(f"[grey78]Generating answer to your question:[grey78] [green_yellow]{instruction}")
46 | #instruction = 'Write a travel blog about a 3-day trip to The Philippines'
47 | generated_text = pipe(instruction)
48 | for text in generated_text:
49 | response += text['generated_text']
50 | wrapped_text = textwrap.fill(response, 100)
51 | console.print(Panel(wrapped_text, title="LaMini Reply", title_align="center"))
52 | stop = datetime.now()
53 | elapsed = stop - start
54 | console.rule(f"Report Generated in {elapsed}")
55 | console.print(f"LaMini @ {datetime.now().ctime()}")
--------------------------------------------------------------------------------
/video_summarization.txt:
--------------------------------------------------------------------------------
1 | The video discusses the reason why Toyota refuses to switch to EVs, with the title "Genius Reason Why Toyota Refuses to Switch To EVS!" and the duration of the video is 0:08:38. Toyota has refused to make the switch to EVs due to the inconsistency in the argument led by Tesla and other EV experts. They want to focus on other propulsion methods such as hydrogen hybrid and internal combustion engines or ice. The reason behind this decision is based on Toyota's understanding of the auto market and the fact that they have planned for the next 20-50 and 100 years, not the next five or ten. Toyota's strategy is to position itself as the go-to option for people who still want to feel the powerful Roar of a gas-powered engine and take a drive without having to worry about charging in range. They believe that EV adoption is not as rosy as we have been made to believe, and that Toyota has developed another type of fuel that could power vehicles and result in cleaner and more efficient cars. They have invested heavily in developing hybrid powertrains and have achieved significant success with them. Toyota may see hybrid technology as a more practical and proven option with existing manufacturing capabilities and infrastructure, and may choose to focus on improving and expanding its hybrid lineup rather than fully transitioning to EVS. Technical challenges, such as battery technology charging times and range limitations, are one of the reasons why Toyota has refused to make the complete switch to electric vehicles. Toyota struggled with its first and only all-electric vehicle, the BZ-4X, due to issues with Hub bolts on the wheels coming loose in the event of hard braking or a sharp turn. The company had to recall and stop the sale of this EV SUV in June 2022 due to Hub bolt issues, and it is possible that Toyota may be working to overcome these challenges or waiting for further advancements in technology for building EVS before committing to large-scale EV production. The reasons mentioned in this video are speculative and may not reflect the current or complete reasoning behind Toyota's decision-making process. Toyota is a large, arguably the largest automaker, and automaker strategies and decisions are influenced by a multitude of factors including market research, consumer preferences, technological advancements, regulatory policies, and economic consideration among others.
--------------------------------------------------------------------------------
/lamini.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
2 | from transformers import pipeline
3 | import torch
4 | from rich import console
5 | from rich.console import Console
6 | from rich.panel import Panel
7 | from rich.text import Text
8 | from rich.theme import Theme
9 | from functools import reduce
10 | from itertools import chain
11 | from datetime import datetime
12 |
13 | #############################################################################
14 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
15 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
16 | # ###########################################################################
17 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
18 |
19 | console = Console(record=True)
20 |
21 | from rich.console import Console
22 | from rich.terminal_theme import MONOKAI
23 |
24 |
25 | console.print("[bold yellow]Preparing the LaMini Model...")
26 | tokenizer = AutoTokenizer.from_pretrained(checkpoint)
27 | base_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
28 | device_map='auto',
29 | torch_dtype=torch.float32)
30 |
31 | pipe = pipeline('text2text-generation',
32 | model = base_model,
33 | tokenizer = tokenizer,
34 | max_length = 512,
35 | do_sample=True,
36 | temperature=0.3,
37 | top_p=0.95,
38 | )
39 |
40 |
41 | """### The prompt & response"""
42 |
43 | import textwrap
44 | response = ''
45 | instruction = console.input("Ask FabioMini: ")
46 | start = datetime.now()
47 | console.print("[red blink]Executing...")
48 | console.print(f"[grey78]Generating answer to your question:[grey78] [green_yellow]{instruction}")
49 | #instruction = 'Write a travel blog about a 3-day trip to The Philippines'
50 | generated_text = pipe(instruction)
51 | for text in generated_text:
52 | response += text['generated_text']
53 | wrapped_text = textwrap.fill(response, 100)
54 | console.print(Panel(wrapped_text, title="FabioMini Reply", title_align="center"))
55 | stop = datetime.now()
56 | elapsed = stop - start
57 | console.rule(f"Report Generated in {elapsed}")
58 | console.print(f"LaMini @ {datetime.now().ctime()}")
59 |
60 | def fix_filename(text):
61 | text = text.replace(' ','_')
62 | text = text.replace(':','-')
63 | return f"{text}.svg"
64 |
65 | #console.save_svg("example.svg", theme=MONOKAI)
66 | console.save_svg(fix_filename(datetime.now().ctime()), theme=MONOKAI)
--------------------------------------------------------------------------------
/lamini-url.py:
--------------------------------------------------------------------------------
1 | from langchain.document_loaders import UnstructuredURLLoader
2 | # required tabulate pip install tabulate
3 | # requires also libmagic pip install libmagic
4 | import ssl
5 |
6 | #if error see comments at the end
7 |
8 | ssl._create_default_https_context = ssl._create_unverified_context
9 |
10 | urls = [
11 | "https://keras.io/examples/nlp/t5_hf_summarization/",
12 | "https://blog.futuresmart.ai/summarizing-documents-made-easy-with-langchain-summarizer"
13 | ]
14 |
15 | loader = UnstructuredURLLoader(urls=urls)
16 | data = loader.load()
17 | print("unstructerLoader...")
18 | print("*"*50)
19 | print(data)
20 |
21 | # WebBaseLoader requires `pip install bs4`
22 | from langchain.document_loaders import WebBaseLoader
23 | loader = WebBaseLoader(urls[0])
24 | data2 = loader.load()
25 | print("WbeBaseloader...")
26 | print("*"*50)
27 | print(data2)
28 |
29 | #Loading multiple webpages
30 | #You can also load multiple webpages at once by passing in a list of urls to the loader.
31 | #This will return a list of documents in the same order as the urls passed in.
32 |
33 | loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
34 | docs = loader.load()
35 | docs
36 |
37 | """
38 | SOURCE https://stackoverflow.com/questions/51925384/unable-to-get-local-issuer-certificate-when-using-requests-in-python
39 | ---
40 | Error fetching or processing https://blog.futuresmart.ai/summarizing-documents-made-easy-with-langchain-summarizer, exeption: HTTPSConnectionPool(host='blog.futuresmart.ai', port=443): Max retries exceeded with url: /summarizing-documents-made-easy-with-langchain-summarizer (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
41 | 1. to chech if you have certificates
42 | python -c "import ssl; print(ssl.get_default_verify_paths())"
43 |
44 | if empty run python
45 | >>> import certifi
46 | >>> certifi.where()
47 | 'C:\\Users\\fmatricard\\Videos\\LaMiniLocal\\venv\\lib\\site-packages\\certifi\\cacert.pem'
48 |
49 | save the certificate from every url click on the locker, details export BASE64 encoded .cer files
50 |
51 | open venv\\lib\\site-packages\\certifi\\cacert.pem' and append the entire certificated between the 2 tags...
52 | (---Begin Certificate--- *** ---End Certificate---)
53 |
54 |
55 | ALTERNATIVE
56 |
57 | For those who this problem persists: - Python 3.6 (some other versions too?) on MacOS comes with its own private copy of OpenSSL. That means the trust certificates in the system are no longer used as defaults by the Python ssl module. To fix that, you need to install a certifi package in your system.
58 |
59 | You may try to do it in two ways:
60 |
61 | 1) Via PIP:
62 |
63 | pip install --upgrade certifi
64 | 2) If it doesn't work, try to run a Cerificates.command that comes bundled with Python 3.6 for Mac:
65 |
66 | open /Applications/Python\ 3.6/Install\ Certificates.command
67 | One way or another, you should now have certificates installed, and Python should be able to connect via HTTPS without any issues.
68 |
69 | """
--------------------------------------------------------------------------------
/my_data/GPT4VsLima.txt:
--------------------------------------------------------------------------------
1 | Title: GPT-4 vs LIMA: Rethinking Large Language Models for Efficiency and Performance | by Amir Shakiba | May, 2023 | Medium
2 |
3 | GPT-4 vs LIMA: Rethinking Large Language Models for Efficiency and Performance
4 | written by Amir Shakiba
5 |
6 |
7 | A recent paper by Meta AI has the potential to revolutionize our understanding of large language models (LLMs).
8 |
9 | To delve into their workings, let’s take a closer look at Meta AI’s LLAMA model. (you can jump straight to LIMA if you want)
10 |
11 | LLAMA
12 |
13 | LLMs, which are trained on vast amounts of text, have given us impressive results. Initially, it was believed that bigger models are necessary for better performance. However, recent papers suggest that smaller models trained on more data can actually deliver better results, challenging the notion of model size. Importantly, practical considerations come into play. In terms of production efficiency, it is more advantageous to train a smaller model for a longer duration, rather than opting for a larger model trained in a shorter time frame that requires more GPU resources during inference.
14 |
15 | smaller models on larger datasets means less cost and more affordability leading to democratization of AI which OpenAI is really concerned about!
16 |
17 | This is where LLAMA models come in. Despite having fewer parameters compared to GPT-3 models, LLAMA models can run on a single GPU. Additionally, LLAMA models are exclusively trained on openly accessible datasets, in contrast to other systems like ChatGPT, which rely on data that is not publicly available.(openAI or closeAI?:)
18 |
19 | LIMA
20 |
21 | Now let’s shift our focus to LIMA, a new LLAMA model developed by Meta AI. LLMs undergo two distinct stages of training. Firstly, they are trained on massive amounts of data to acquire general-purpose representations. Secondly, instruction tuning or reinforcement learning is employed to guide the model for specific tasks. Notably, reinforcement learning from human feedback (RLHF) has been championed by OpenAI as a crucial aspect of training models like ChatGPT. However, this new study suggests that RLHF has limited impact on training. The majority of learning occurs during pretraining and training on the massive text corpus.
22 |
23 | If humans are not even needed for their feedback ,what makes them useful?
24 |
25 | LIMA, a 65B-parameter LLAMA model, stands out as it is trained on only 1,000 precise prompts. Remarkably, LIMA achieves competitive results comparable to GPT-4, Claude, or Bard. This highlights the power of pretraining and diminishes the significance of large-scale instruction tuning and reinforcement learning approaches.
26 |
27 | In summary, Meta AI’s research sheds light on the potential of LLAMA models and challenges the conventional understanding of LLMs. The focus on training smaller models on larger datasets and the limited role of reinforcement learning in training highlight the efficiency and effectiveness of this approach. LIMA exemplifies the promising capabilities of LLAMA models and their ability to achieve impressive performance with significantly fewer parameters.
28 |
29 | our understanding of billion parameter world is too little , there’s more to discover.
30 |
31 | link to the papers:
32 |
33 | LLAMA: https://arxiv.org/pdf/2302.13971.pdf
34 | LIMA : https://arxiv.org/pdf/2305.11206.pdf
--------------------------------------------------------------------------------
/st-yourTranslationApp.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import torch
3 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
4 | from transformers import pipeline
5 | from langchain.text_splitter import CharacterTextSplitter
6 | import datetime
7 | import os
8 | import sys
9 |
10 | ############# Displaying images on the front end #################
11 | st.set_page_config(page_title="Your AI translation App",
12 | page_icon='♾️',
13 | layout="centered", #or wide
14 | initial_sidebar_state="expanded",
15 | menu_items={
16 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
17 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
18 | 'About': "# This is a header. This is an *extremely* cool app!"
19 | },
20 | )
21 |
22 | # 🈚🆗✅💬🇮🇹🇺🇸
23 | #LOCAL MODEL EN-IT
24 | #---------------------------------
25 | # Helsinki-NLP/opus-mt-en-it
26 | Model_IT = './model_it/' #torch
27 | #---------------------------------
28 |
29 | ### HEADER section
30 | st.title("Your AI powered Text Translator 💬 ")
31 | st.header("Translate your English text to Italian")
32 | #st.image('Headline.jpg', width=750)
33 | English = st.text_area("Paste here the English text...", height=300, key="original")
34 | col1, col2, col3 = st.columns([2,5,2])
35 | btn_translate = col2.button("✅ Start Translation", use_container_width=True, type="primary", key='start')
36 | if btn_translate:
37 | if English:
38 | Model_IT = './model_it/' #torch
39 | with st.spinner('Initializing pipelines...'):
40 | st.success(' AI Translation started', icon="🆗")
41 | from langchain.text_splitter import CharacterTextSplitter
42 | # TEXT SPLITTER FUNCTION FOR CHUNKING
43 | text_splitter = CharacterTextSplitter(
44 | separator = "\n\n",
45 | chunk_size = 300,
46 | chunk_overlap = 0,
47 | length_function = len,
48 | )
49 | # CHUNK THE DOCUMENT
50 | st.success(' Chunking text...', icon="🆗")
51 | texts = text_splitter.create_documents([English])
52 | #print('[bold red] Inizialize AI toknizer...')
53 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
54 | # INITIALIZE TRANSLATION FROM ENGLISH TO ITALIAN
55 | tokenizer_tt0it = AutoTokenizer.from_pretrained(Model_IT) #google/byt5-small #facebook/m2m100_418M
56 | st.success(' Initializing AI Model & pipeline...', icon="🆗")
57 | model_tt0it = AutoModelForSeq2SeqLM.from_pretrained(Model_IT) #Helsinki-NLP/opus-mt-en-it or #Helsinki-NLP/opus-mt-it-en
58 | #print("pipeline")
59 | TToIT = pipeline("translation", model=model_tt0it, tokenizer=tokenizer_tt0it)
60 | # ITERATE OVER CHUNKS AND JOIN THE TRANSLATIONS
61 | finaltext = ''
62 | start = datetime.datetime.now() #not used now but useful
63 | print('[bold yellow] Translation in progress...')
64 | for item in texts:
65 | line = TToIT(item.page_content)[0]['translation_text']
66 | finaltext = finaltext+line+'\n'
67 | stop = datetime.datetime.now() #not used now but useful
68 | elapsed = stop - start
69 | st.success(f'Translation completed in {elapsed}', icon="🆗")
70 | print(f'[bold underline green1] Translation generated in [reverse dodger_blue2]{elapsed}[/reverse dodger_blue2]...')
71 | st.text_area(label="Translated text in Italian:", value=finaltext, height=350)
72 | st.markdown(f'Translation completed in **{elapsed}**')
73 | st.markdown(f"Translated number **{len(English.split(' '))}** of words")
74 |
75 | else:
76 | st.warning("You need some text to be translated!", icon="⚠️")
--------------------------------------------------------------------------------
/working/testState.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import pandas as pd
4 | from io import StringIO
5 | from PIL import Image
6 | import ssl
7 |
8 |
9 | ############# Displaying images on the front end #################
10 | st.set_page_config(page_title="Summarize and Talk ot your Documents",
11 | page_icon='📖',
12 | layout="centered", #or wide
13 | initial_sidebar_state="expanded",
14 | menu_items={
15 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
16 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
17 | 'About': "# This is a header. This is an *extremely* cool app!"
18 | },
19 | )
20 | ########### SSL FOR PROXY ##############
21 | ssl._create_default_https_context = ssl._create_unverified_context
22 |
23 | #### IMPORTS FOR AI PIPELINES ###############
24 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
25 | from transformers import pipeline
26 |
27 | from transformers import AutoModel, T5Tokenizer, T5Model
28 | from transformers import T5ForConditionalGeneration
29 | from langchain.llms import HuggingFacePipeline
30 | import torch
31 | #from functools import reduce #for highlighter
32 | #from itertools import chain #for highlighter
33 | import datetime
34 |
35 | #############################################################################
36 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
37 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
38 | # ###########################################################################
39 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
40 | LaMini = './model/'
41 |
42 |
43 | ######################################################################
44 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
45 | ######################################################################
46 |
47 | with open('video_transcript.txt') as f:
48 | testo = f.read()
49 | f.close()
50 |
51 | global video_summary
52 | global processed
53 | processed = False
54 | #with st.columns(3)[1]:
55 | # st.header("hello world")
56 | # st.image("http://placekitten.com/200/200")
57 | #st.title('🏞️ Image display methods')
58 | #
59 | #st.write("a logo and text next to eachother")
60 | #
61 |
62 | ### HEADER section
63 | col1, col2, col3 = st.columns([1,20, 1])
64 | col2.image('youtubelogo.jpg', width=180)
65 | col2.title('AI Summarizer')
66 |
67 | #image_path = 'logo.png'
68 | #image = Image.open(image_path)
69 | #st.image('https://streamlit.io/images/brand/streamlit-mark-light.png')
70 | #st.image(image_path, width = 700)
71 |
72 |
73 | title = st.text_input('1. Input your Youtube Video url', 'Something like https://youtu.be/SCYMLHB7cfY....', key='inputurl') #https://youtu.be/SCYMLHB7cfY
74 | videotitle = st.empty()
75 | st.write(st.session_state.inputurl)
76 |
77 | txt = st.empty()
78 | video_dur = st.empty()
79 | video_redux = st.empty()
80 | st.divider()
81 |
82 |
83 |
84 |
85 | txt.text_area('Summarized text', testo, height = 450, key = 'result')
86 | def putto():
87 | st.session_state.result = testo
88 | st.write(st.session_state.result)
89 | video_redux.markdown(f"Percentage of reduction: 15 {len(testo[:350].split(' '))}/{len(testo.split(' '))} words")
90 | if st.button('2. Start Summarization', on_click=putto):
91 | processed = True #stuts for the download button
92 | print(processed)
93 | st.write(st.session_state.result)
94 |
95 |
96 | if st.button('3. Download Summarization'):
97 | with open('putto.txt', 'w') as f:
98 | f.write(st.session_state.result)
99 | f.close()
100 | st.markdown(f"## Download your YouTube Video Summarization")
101 | st.success('AI Summarization saved in video_summarization.txt', icon="✅")
102 |
103 |
104 |
105 |
--------------------------------------------------------------------------------
/W9EP8sqMHDA-video_transcript.txt:
--------------------------------------------------------------------------------
1 | TITLE: Opensource Supremacy? | OpenLLAMA: The Revenge of Opensource?
2 | video Duration: 0:05:17
3 | ----------------------------------------
4 | this is open llama it is an open source reproduction of meta ai's popular llama large language models unlike the original this release is also permissively licensed it is being developed by these two researchers at Berkeley AI research to understand why this work is important we need to travel back in time to the old days of AI to precisely the 25th of February 2023 a little over two months ago it was a good day and the unlikun the true ring price winner and AI honcho at meta AI sent this tweet announcing to the world the release of the Llama large language models the research paper and the code then he added these two more tweets for context and the a world was transformed it was now clear that he was talking about a game changer here's why this were the first models to train on trillions of tokens using only openly available data the model architecture was also new they had longer context length of at least 1496 tokens as shown in this table they clearly beat the state of the art on this standard performance benchmarks llama 32 billion parameter outperformed GPT 3 on most benchmarks despite being 10 times smaller llamas 65 billion parameter is also competitive with the state-of-the-art such as chinchilla with 70 billion parameter and palm with 540 billion parameter and the training loss curves showed that the models could still improve by trailing beyond the 1.4 trillion tokens some people reading the Tweet saw just one problem it wasn't the data all the language models themselves no it was the license which did not permit commercial use as shown here you also had to request access to get the weight so these conditions rubbed a lot of none researches the wrong way and for a while Jan lacun's Twitter streams looked like the hotline for disgruntled teenagers and so with the fullness of time and a glim of Hope emerged across the Atlantic in Europe a Consortium of AI researchers known as together had been doing open source the traditional way and now they focused on reproducing the Llama data set they carefully prepare the source data cleaned it and closely replicated the reported llama data set and on April 17 2023 they released 1.2 trillion token deficit as red pajama this is how it Compares with the Llama data set and this brings us back to the present to open llama the researchers at Berkeley AI research have taken the red pajama dataset and are in the process of reproducing the published research on the Llama large language models they have written that they are using the same architecture the same pre-processing steps and the same hyper parameters only the training data is replaced by the red pajama data set this release is a preview release of the 7 billion parameter model trained on only 200 billion tokens of data which is only one-fifth of the total for the 7 billion parameter model they have also published this preliminary evaluation which looks promising however the training is only about 20 percent done so apart from quality control and project documentation further conclusions made from these Benchmark results can only be temporary about open Llama training loss curve I noticed the flattening of the learning rate after training for 200 billion parameters this line seems almost horizontal while it is likely an artifact of the plotting scale or something else I'm comparing it to this loss curve in the Llama paper where at 200 billion Tokens The Lost curve still had a clear downward slope this slope LED them to speculate that further training was very likely possible well beyond the 1.4 trillion tokens and now let's look at the future of open Llama indeed the future looks bright as you can see here on the chat language model tracker the Llama models are highlighted in yellow and they are the most popular I must say I'm looking forward to open Llama becoming a free alternative model for all these chat language models that now use llama but also keep releasing their llm research code and weights to the world to help the world become a better place thank you
--------------------------------------------------------------------------------
/text_summarization.txt:
--------------------------------------------------------------------------------
1 | Automatic text summarization with machine learning is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. It is a challenging task that requires extensive research in the NLP area. There are two different approaches for automatic text summaryization: extraction and abstraction. The extraction method involves identifying important sections of the text and stitching together portions of the content to produce a condensed version. The scoring function assigns a value to each sentence denoting the probability with which it will get picked up in the summary. The process involves constructing an intermediate representation of the input text and scoring the sentences based on the representation. A typical flow of extractive summarization systems involves constructing intermediate representations of the input text, scoring sentences based on the representation, and using Latent semantic analysis (LSA) to identify semantically important sentences. Recent studies have also applied deep learning in extractive text summaryization, such as Sukriti's approach for factual reports using a deep learning model, Yong Zhang's document summarizing framework using convolutional neural networks, and Y. Kim's regression process for sentence ranking. The neural architecture used in the paper is compounded by one single convolution layer built on top of pre-trained word vectors followed by a max-pooling layer. Experiments have shown the proposed model achieved competitive or even better performance compared with baselines. Abstractive summarization methods aim to produce summary by interpreting the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original text. They take advantage of recent developments in deep learning and use an attention-based encoder-decoder method for generating abstractive summaries. Recent studies have argued that attention to sequence models can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text. Junyang Lin et al proposes a gated unit on top of the encoder outputs at each time step to tackle this problem. The code to reproduce the experiments from the NAMAS paper can be found here. The Pointer Network is a neural attention-based sequence-to-sequence architecture that learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Other methods for abstractive summarization include Pointer-Generator, which allows copying words from the input sequence via pointing of specific positions, and a generator that generates words from a fixed vocabulary of 50k words. To overcome repetition problems, the paper adapts the coverage model of Tu et al. to overcome the lack of coverage of source words in neural machine translation models. To train the extractor on available document-summary pairs, the model uses a policy-based reinforcement learning (RL) with sentence-level metric rewards to connect both extractor and abstractor networks and to learn sentence saliency. The abstractor network is an emphasis-based encoder-decoder which compresses and paraphrases an extracted document sentence to a concise summary sentence. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The extractor agent is a convolutional sentence encoder that computes representations for each sentence based on input embedded word vectors. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The method incorporates abstractive approach advantages of concisely rewriting sentences and generating novel words from the full vocabulary, while adopting intermediate extractive behavior to improve the overall model's quality, speed, and stability. Recent studies have proposed a combination of adversarial processes and reinforcement learning to abstractive summarization. The extractive approach is easier because copying large chunks of text from the source document ensures good levels of grammaticality and accuracy, while the abstractive model generates new phrases, rephrasing or using words that were not in the original text. Recent developments in the deep learning area have allowed for more sophisticated abilities to be generated.
--------------------------------------------------------------------------------
/working/dataf.py:
--------------------------------------------------------------------------------
1 | #################### VIDEO AND SCRIPT SECTION #############################
2 | # Extract Video Informations for future use in PDF or word processor
3 | # Prepare the text for Summarization (no fuss, plain text)
4 | ###########################################################################
5 | import ssl
6 | from pytube import YouTube as YT
7 | import re
8 | import textwrap
9 |
10 | # SSL for proxied internet access
11 | ssl._create_default_https_context = ssl._create_unverified_context
12 |
13 | # paste here the url of the youtubevideo you want
14 | url = "https://youtu.be/SCYMLHB7cfY" # or https://youtu.be/5g1z4Sr-UHM
15 | # we instantiate a YouTube object already with our
16 | myvideo = YT(url, use_oauth=True, allow_oauth_cache=True)
17 | # required only for the first time to know what languages are aavailable
18 | print(myvideo.title)
19 | print(myvideo.captions) #print the options of languages available
20 | #Commented for automatically choice to auto-generated
21 | #code = input("input the code you want: ") #original
22 | print("Scraping subtitles...")
23 | #Commented to test the auto-generated ones
24 | #sub = myvideo.captions[code] #original
25 | sub = myvideo.captions['a.en']
26 | caption = sub.generate_srt_captions()
27 | #print(caption)
28 |
29 | # Club Video Title, details and Description, only for printed version
30 | # not for the Sumarization one
31 | # possible in future to prepare for Markdown to PDF export
32 | import datetime
33 | m1 = f"TITLE: {myvideo.title}"+'\n'
34 | m2 = f"thumbnail url: {myvideo.thumbnail_url}"+'\n'
35 | m4 = f"video Duration: {str(datetime.timedelta(seconds=myvideo.length))}"+'\n'
36 | m5 = "----------------------------------------"+'\n'
37 | #m6 = textwrap.fill(myvideo.description, 80)+'\n' #solution not good
38 | m6 = myvideo.description+'\n'
39 | m7 = "----------------------------------------"+'\n'
40 | m_intro = m1+m2+m4+m5+m6+m7
41 |
42 | # Function to clean up the srt text
43 | def clean_sub(sub_list):
44 | lines = sub_list
45 | text = ''
46 | for line in lines:
47 | if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
48 | text += ' ' + line.rstrip('\n')
49 | text = text.lstrip()
50 | #print(text)
51 | return text
52 |
53 | print("Transform subtitles to TEXT...")
54 | srt_list = str(caption).split('\n') #generate a list with all lines
55 | final_text = clean_sub(srt_list)
56 |
57 | to_sum_text = m1+m4+m5+final_text
58 | wrapped_text = textwrap.fill(to_sum_text, 100)
59 | print(wrapped_text)
60 | with open('video_transcript.txt', 'w') as f:
61 | f.write(to_sum_text)
62 | f.close()
63 | print('File video_transcript.txt saved')
64 |
65 |
66 | """
67 | #print(final_text)
68 | #PREPARE A LONG STRING FOR THE SUMMARIZATION, no Video Description Here
69 | intro_summarization = 'Video Title: '+myvideo.title+' - (video url: )'+url+' --- '+'\n'
70 | #PREPARE A LONG TEXT with details and Video Description Here: to be used
71 | #as a note or Blog
72 | intro_blog = 'Video Title: '+myvideo.title+'\n'+'Video url: '+url+'\n'+'-------------'+'\n'
73 | summarization_text = intro_summarization + final_text
74 | import textwrap
75 | blog_text = m_intro + textwrap.fill(final_text, 70)
76 |
77 | def correct_filename_cr(title):
78 | string = title
79 | finalfilename = ''.join(e for e in string if e.isalnum())+'_cr.txt'
80 | return finalfilename
81 | def correct_filename_nocr(title):
82 | string = title
83 | finalfilename = ''.join(e for e in string if e.isalnum())+'_nocr.txt'
84 | return finalfilename
85 |
86 | print("Saving to TXT file...")
87 | filename_file = correct_filename_cr(myvideo.title) #with carriage return
88 | filename_summary = correct_filename_nocr(myvideo.title) #with no carriege return
89 |
90 | #prepare the text for the Blog or Notes
91 | #with carriage return
92 | import textwrap
93 | wrapped_text = textwrap.fill(final_text, 100)
94 |
95 | # write the cleaned up text into a file called final_text.txt
96 | with open(filename_file, 'w') as f:
97 | f.write(blog_text)
98 | f.close()
99 | with open(filename_summary, 'w') as f:
100 | f.write(summarization_text)
101 | f.close()
102 | """
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 | """
115 | import ssl
116 | import requests
117 |
118 | ########### SSL FOR PROXY ##############
119 | ssl._create_default_https_context = ssl._create_unverified_context
120 |
121 | response = requests.get('https://github.com/fabiomatricardi/pytubeFix/raw/main/1213-python10/captions.py')
122 |
123 | with open('captions.py', 'w') as f:
124 | f.write(response)
125 | f.close()
126 |
127 | response = requests.get('https://github.com/fabiomatricardi/pytubeFix/raw/main/1213-python10/cipher.py')
128 |
129 | with open('cipher.py', 'w') as f:
130 | f.write(response)
131 | f.close()
132 |
133 |
134 | """
--------------------------------------------------------------------------------
/pp.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | #### IMPORTS FOR AI PIPELINES ###############
4 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
5 | from transformers import pipeline
6 | from transformers import AutoModel, T5Tokenizer, T5Model
7 | from transformers import T5ForConditionalGeneration
8 | from langchain.llms import HuggingFacePipeline
9 | import torch
10 |
11 | # SET THE MODEL PATH
12 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
13 | # INITIALIZE TOKENIZER AND MODEL
14 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
15 | base_model = T5ForConditionalGeneration.from_pretrained(
16 | checkpoint,
17 | device_map='auto',
18 | torch_dtype=torch.float32)
19 | pipe_sum = pipeline('summarization',
20 | model = base_model,
21 | tokenizer = tokenizer,
22 | max_length = 350,
23 | min_length = 25)
24 |
25 | text = " Automatic text summarization with machine learning is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. It is a challenging task that requires extensive research in the NLP area. There are two different approaches for automatic text summaryization: extraction and abstraction. The extraction method involves identifying important sections of the text and stitching together portions of the content to produce a condensed version. The scoring function assigns a value to each sentence denoting the probability with which it will get picked up in the summary. The process involves constructing an intermediate representation of the input text and scoring the sentences based on the representation. A typical flow of extractive summarization systems involves constructing intermediate representations of the input text, scoring sentences based on the representation, and using Latent semantic analysis (LSA) to identify semantically important sentences. Recent studies have also applied deep learning in extractive text summaryization, such as Sukriti's approach for factual reports using a deep learning model, Yong Zhang's document summarizing framework using convolutional neural networks, and Y. Kim's regression process for sentence ranking. The neural architecture used in the paper is compounded by one single convolution layer built on top of pre-trained word vectors followed by a max-pooling layer. Experiments have shown the proposed model achieved competitive or even better performance compared with baselines. Abstractive summarization methods aim to produce summary by interpreting the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original text. They take advantage of recent developments in deep learning and use an attention-based encoder-decoder method for generating abstractive summaries. Recent studies have argued that attention to sequence models can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text. Junyang Lin et al proposes a gated unit on top of the encoder outputs at each time step to tackle this problem. The code to reproduce the experiments from the NAMAS paper can be found here. The Pointer Network is a neural attention-based sequence-to-sequence architecture that learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Other methods for abstractive summarization include Pointer-Generator, which allows copying words from the input sequence via pointing of specific positions, and a generator that generates words from a fixed vocabulary of 50k words. To overcome repetition problems, the paper adapts the coverage model of Tu et al. to overcome the lack of coverage of source words in neural machine translation models. To train the extractor on available document-summary pairs, the model uses a policy-based reinforcement learning (RL) with sentence-level metric rewards to connect both extractor and abstractor networks and to learn sentence saliency. The abstractor network is an emphasis-based encoder-decoder which compresses and paraphrases an extracted document sentence to a concise summary sentence. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The extractor agent is a convolutional sentence encoder that computes representations for each sentence based on input embedded word vectors. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The method incorporates abstractive approach advantages of concisely rewriting sentences and generating novel words from the full vocabulary, while adopting intermediate extractive behavior to improve the overall model's quality, speed, and stability. Recent studies have proposed a combination of adversarial processes and reinforcement learning to abstractive summarization. The extractive approach is easier because copying large chunks of text from the source document ensures good levels of grammaticality and accuracy, while the abstractive model generates new phrases, rephrasing or using words that were not in the original text. Recent developments in the deep learning area have allowed for more sophisticated abilities to be generated."
26 | # RUN THE PIPELINE ON THE TEXT AND PRINT RESULT
27 | result = pipe_sum(text)
28 | print(result)
29 | # print(result[0]['summary_text'])
--------------------------------------------------------------------------------
/on MediumRepo/main.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | #### IMPORTS FOR AI PIPELINES ###############
4 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
5 | from transformers import pipeline
6 | from transformers import AutoModel, T5Tokenizer, T5Model
7 | from transformers import T5ForConditionalGeneration
8 | from langchain.llms import HuggingFacePipeline
9 | import torch
10 |
11 | # SET THE MODEL PATH
12 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
13 | # INITIALIZE TOKENIZER AND MODEL
14 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
15 | base_model = T5ForConditionalGeneration.from_pretrained(
16 | checkpoint,
17 | device_map='auto',
18 | torch_dtype=torch.float32)
19 | pipe_sum = pipeline('summarization',
20 | model = base_model,
21 | tokenizer = tokenizer,
22 | max_length = 350,
23 | min_length = 25)
24 |
25 | text = " Automatic text summarization with machine learning is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. It is a challenging task that requires extensive research in the NLP area. There are two different approaches for automatic text summaryization: extraction and abstraction. The extraction method involves identifying important sections of the text and stitching together portions of the content to produce a condensed version. The scoring function assigns a value to each sentence denoting the probability with which it will get picked up in the summary. The process involves constructing an intermediate representation of the input text and scoring the sentences based on the representation. A typical flow of extractive summarization systems involves constructing intermediate representations of the input text, scoring sentences based on the representation, and using Latent semantic analysis (LSA) to identify semantically important sentences. Recent studies have also applied deep learning in extractive text summaryization, such as Sukriti's approach for factual reports using a deep learning model, Yong Zhang's document summarizing framework using convolutional neural networks, and Y. Kim's regression process for sentence ranking. The neural architecture used in the paper is compounded by one single convolution layer built on top of pre-trained word vectors followed by a max-pooling layer. Experiments have shown the proposed model achieved competitive or even better performance compared with baselines. Abstractive summarization methods aim to produce summary by interpreting the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original text. They take advantage of recent developments in deep learning and use an attention-based encoder-decoder method for generating abstractive summaries. Recent studies have argued that attention to sequence models can suffer from repetition and semantic irrelevance, causing grammatical errors and insufficient reflection of the main idea of the source text. Junyang Lin et al proposes a gated unit on top of the encoder outputs at each time step to tackle this problem. The code to reproduce the experiments from the NAMAS paper can be found here. The Pointer Network is a neural attention-based sequence-to-sequence architecture that learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Other methods for abstractive summarization include Pointer-Generator, which allows copying words from the input sequence via pointing of specific positions, and a generator that generates words from a fixed vocabulary of 50k words. To overcome repetition problems, the paper adapts the coverage model of Tu et al. to overcome the lack of coverage of source words in neural machine translation models. To train the extractor on available document-summary pairs, the model uses a policy-based reinforcement learning (RL) with sentence-level metric rewards to connect both extractor and abstractor networks and to learn sentence saliency. The abstractor network is an emphasis-based encoder-decoder which compresses and paraphrases an extracted document sentence to a concise summary sentence. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The extractor agent is a convolutional sentence encoder that computes representations for each sentence based on input embedded word vectors. An RNN encoder computes context-aware representation and then an RNN decoder selects sentence at time step t. The method incorporates abstractive approach advantages of concisely rewriting sentences and generating novel words from the full vocabulary, while adopting intermediate extractive behavior to improve the overall model's quality, speed, and stability. Recent studies have proposed a combination of adversarial processes and reinforcement learning to abstractive summarization. The extractive approach is easier because copying large chunks of text from the source document ensures good levels of grammaticality and accuracy, while the abstractive model generates new phrases, rephrasing or using words that were not in the original text. Recent developments in the deep learning area have allowed for more sophisticated abilities to be generated."
26 | # RUN THE PIPELINE ON THE TEXT AND PRINT RESULT
27 | result = pipe_sum(text)
28 | print(result)
29 | # print(result[0]['summary_text'])
--------------------------------------------------------------------------------
/test-translation.py:
--------------------------------------------------------------------------------
1 | import ssl
2 | ########### SSL FOR PROXY ##############
3 | ssl._create_default_https_context = ssl._create_unverified_context
4 |
5 |
6 | #### IMPORTS FOR AI PIPELINES ###############
7 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
8 | from transformers import pipeline
9 |
10 | from transformers import AutoModel, T5Tokenizer, T5Model
11 | from transformers import T5ForConditionalGeneration
12 | from langchain.llms import HuggingFacePipeline
13 | import torch
14 | #from functools import reduce #for highlighter
15 | #from itertools import chain #for highlighter
16 | import datetime
17 | import os
18 | import requests
19 | from langchain.embeddings import HuggingFaceEmbeddings #for using HugginFace models
20 | from langchain import HuggingFaceHub
21 | # PUT HERE YOUR HUGGING FACE API TOKEN shuold start with hf_XXXXXXX...
22 | os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"
23 | #############################################################################
24 | # SIMPLE TRANSLATION GENERATION INFERENCE
25 | # checkpoint = "./model_it/ Helsinki-NLP/opus-mt-en-it
26 | # ###########################################################################
27 |
28 | from rich import console
29 | from rich.console import Console
30 | from rich.panel import Panel
31 | from rich.text import Text
32 | from functools import reduce
33 | from itertools import chain
34 | import datetime
35 |
36 | console = Console()
37 |
38 | #LOCAL MODEL EN-IT
39 | #---------------------------------
40 | # Helsinki-NLP/opus-mt-en-it
41 | Model_IT = './model_it/' #torch
42 | #---------------------------------
43 |
44 | pippo = """
45 | Introduction
46 | We rely on google translate services to help us with translating text from one language to other and I always wanted to develop an app like that and to know how translation works in the backend.
47 |
48 | Let’s list down the components which will be there in our translate app.
49 |
50 | These are those components -
51 |
52 | A Multi-language translation model.
53 | An API service which takes all the necessary parameters sends those parameters to the model and returns the translated text back as response.
54 | A front-end app which provides a GUI to the user to interact with.
55 | An ideal flow will look like user typing a text in the input text, selecting the desired target language and clicking the translate button. Once the button is clicked, we get the request data and send to the API service which eventually passes that to the model and get the results back as a response. The response is then shown into the UI.
56 |
57 | The Big Challenge
58 | The ML model is going to be the brain behind our translation app. To train a state-of-the-art model from scratch we would need the following things -
59 |
60 | Huge amounts of training data containing examples of text in one language and its translation in the other language.
61 | Create a neural network model which consists of more than a million parameters.
62 | A high end multi-GPU based environment to train that model.
63 | Time.
64 | But my goal here is to develop a MVP or a small POC which could help me understand and demonstrate the process of making a translation app and be able to complete this over a weekend.
65 |
66 | HuggingFace to the rescue
67 | The solution is that we can use a pre-trained model which is trained for translation tasks and can support multiple languages.
68 |
69 | HuggingFace consists of an variety of transformers/pre-trained models. One of the translation models is MBart which was presented by Facebook AI research team in 2020 — Multilingual Denoising Pre-training for Neural Machine Translation.
70 |
71 | Great!! Now that we have a pre-trained model in place. Time to put the pieces of the puzzle in the right places.
72 | """
73 | with console.status("Preparing model and pipeline...", spinner="monkey"):
74 | from langchain.text_splitter import CharacterTextSplitter
75 | # TEXT SPLITTER FUNCTION FOR CHUNKING
76 | text_splitter = CharacterTextSplitter(
77 | separator = "\n\n",
78 | chunk_size = 300,
79 | chunk_overlap = 0,
80 | length_function = len,
81 | )
82 | # CHUNK THE DOCUMENT
83 | console.print('[bold blue] Chunking the text...')
84 | texts = text_splitter.create_documents([pippo])
85 | console.print('[bold red] Inizialize AI toknizer...')
86 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
87 | # INITIALIZE TRANSLATION FROM ENGLISH TO ITALIAN
88 |
89 | tokenizer_tt0it = AutoTokenizer.from_pretrained(Model_IT) #google/byt5-small #facebook/m2m100_418M
90 | console.print('[bold green] Inizialize AI model...')
91 | model_tt0it = AutoModelForSeq2SeqLM.from_pretrained(Model_IT) #Helsinki-NLP/opus-mt-en-it or #Helsinki-NLP/opus-mt-it-en
92 | TToIT = pipeline("translation", model=model_tt0it, tokenizer=tokenizer_tt0it)
93 | # Example TToIT("How old are you?")[0]['translation_text']
94 |
95 | # ITERATE OVER CHUNKS AND JOIN THE TRANSLATIONS
96 | finaltext = ''
97 | start = datetime.datetime.now() #not used now but useful
98 | console.print('[bold yellow] Translation in progress...')
99 | for item in texts:
100 | line = TToIT(item.page_content)[0]['translation_text']
101 | finaltext = finaltext+line+'\n'
102 | stop = datetime.datetime.now() #not used now but useful
103 | elapsed = stop - start
104 | console.print(f'[bold underline green1] Translation generated in [reverse dodger_blue2]{elapsed}[/reverse dodger_blue2]...')
105 | console.print(Panel(finaltext, title="AI Translatio", title_align="center"))
106 |
107 |
108 |
--------------------------------------------------------------------------------
/750x250.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/my_data/extractive-vs-abstractive-summarization-in-healthcare.txt:
--------------------------------------------------------------------------------
1 | Title: Extractive vs Abstractive Summarization in Healthcare
2 | ------------------------------------------------------------
3 | source: https://www.abstractivehealth.com/extractive-vs-abstractive-summarization-in-healthcare
4 | author is Vince Hartman
5 |
6 | Extractive vs Abstractive Summarization in Healthcare
7 | There are two approaches to summarize information: extractive summarization which copies the most relevant sentences from a text, and abstractive summarization which generates new sentences. Abstractive summarization is the most promising method for automated text summarization and has recently been possible thanks to the advancement of the NLP transformer models.
8 |
9 | Summarizing text is surprisingly hard. While there are an infinite number of ways you can distill text into the most important parts, doing it well requires you to master conciseness, coherence and comprehension. We create summaries regularly for numerous activities such as academic papers, wikipedia entries, movies, books, legal documents, business ideas, and even ourselves (the “tell me about yourself” in interviews). It’s no wonder that an automated summary has been pursued for over 70 years in the fields of statistics and computer science, and 20 years within healthcare. Recently, results for automated computerized summaries have been impressive. And for the first time, a good abstractive summary is now possible in healthcare.
10 |
11 | So what is abstractive summarization? In the field of summarization, there are two approaches: extractive and abstractive methods. Extractive summarization copies (or extracts) the most important words or phrases from a text to concatenate the content: i.e. imagine selecting the top 3 sentences in a document and presenting those as the summary. Abstractive summarization generates new sentences that never existed by synthesizing the salience of the original text: i.e. paraphrasing the central idea in your own words.
12 |
13 | An obvious problem with extractive summarization is that it lacks fluency; the sentences don’t flow naturally from one sentence to the next. It is generally jarring since there are no transitions between topics and the next sentence. Secondly and most importantly, the main idea of the text might be buried in the original source text and thus cannot be captured in one individual sentence, so comprehension might be lacking. Extractive summarization generally works well for a structured source text, like a news article, where the author presents the most important content to the reader in a key thesis sentence (that topic statement we were trained to write for the five-paragraph essay). Where extractive methods fail is for more artistic and unstructured text when the main idea is a crescendo over numerous pages. Such as when we read a great novel and come to understand the main idea as we reach the denouement. For example, extraction would work poorly for a novel like Moby Dick which opens with the iconic line “Call me Ishmael''. While a beautiful and popular line, the sentence by itself provides little context that the novel is ultimately about the destructive nature of Ahab’s obsessive quest of a gigantic sperm whale.
14 |
15 | In healthcare, extractive summarization is great to get the high level diagnoses, allergies, and past procedures for a patient, and then structure all that content into a simple rule-based algorithm. Some general weakness of this approach is the summary reads like a computer wrote it and maintaining all those rules becomes quickly difficult. The most glaring weakness is that these summaries lack context for how the patient is progressing over their treatment. For example, about 10% of the US population has type 2 diabetes, it’s a very common disease; but the disease can be life threatening if not managed properly. A typical extractive summary of a patient would inform you that a patient has the ICD-10 code for diabetes, but it provides no context if the patient has been managing their blood sugar levels well or is at risk for hospitalization. The course of their treatment is not captured in the extractive summary and the physician is still left to search through the hundreds of notes to understand their patient. This is where abstractive summarization techniques excel.
16 |
17 | Abstractive summarization is relatively new in healthcare and has coincided with the advancement of the NLP transformer models that have taken off since 2017 (the release of BERT). Because healthcare is particularly challenging, the only commercial applications to date are for automating the radiology impression section for radiologists. The impression section summarizes the key findings of a radiology report, so a computerized version saves the radiologists’ time by not needing to manually write out that summary. The findings section of a radiology report is generally less than 500 words, so a computerized summary does not have to worry about the challenges of longform documentation (i.e. summarizing thousands of words from the whole medical record). That said, these commercial applications still need to address other challenges with a good factual summary in healthcare designed for an individual physician; so the technology is definitely impressive.
18 |
19 | With our current pilot with Abstractive Health with Weill Cornell Medical Center, we are building the first commercial abstractive summary of the full patient record in healthcare (so hundreds of notes and not just the radiology report). Our summarization structure is based on those same NLP transformer models from 2017 with some significant modifications. And one of our core research assessments that we are demonstrating is that our automated summary of the patient chart is a close equivalent to a physician written summary. Thus, our tool could be used as a supplement for physicians at patient admission, transfer, and discharge workflows.
20 |
--------------------------------------------------------------------------------
/LaMini-TextSummarizer.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import ssl
4 | ############# Displaying images on the front end #################
5 | st.set_page_config(page_title="Summarize and Talk ot your Text",
6 | page_icon='📖',
7 | layout="centered", #or wide
8 | initial_sidebar_state="expanded",
9 | menu_items={
10 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
11 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
12 | 'About': "# This is a header. This is an *extremely* cool app!"
13 | },
14 | )
15 | ########### SSL FOR PROXY ##############
16 | ssl._create_default_https_context = ssl._create_unverified_context
17 |
18 | #### IMPORTS FOR AI PIPELINES ###############
19 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
20 | from transformers import pipeline
21 |
22 | from transformers import AutoModel, T5Tokenizer, T5Model
23 | from transformers import T5ForConditionalGeneration
24 | from langchain.llms import HuggingFacePipeline
25 | import torch
26 | import datetime
27 |
28 | #############################################################################
29 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
30 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
31 | # ###########################################################################
32 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
33 | LaMini = './model/'
34 |
35 | ######################################################################
36 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
37 | ######################################################################
38 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
39 |
40 | """
41 | checkpoint is in the format of relative path
42 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
43 | text it is either a long string or a input long string or a loaded document into string
44 | chunks: integer, lenght of the chunks splitting
45 | ovelap: integer, overlap for cor attention and focus retreival
46 | RETURNS full_summary (str), delta(str) and reduction(str)
47 |
48 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
49 | USAGE EXAMPLE:
50 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
51 | """
52 | from langchain.text_splitter import RecursiveCharacterTextSplitter
53 | text_splitter = RecursiveCharacterTextSplitter(
54 | # Set a really small chunk size, just to show.
55 | chunk_size = chunks,
56 | chunk_overlap = overlap,
57 | length_function = len,
58 | )
59 | texts = text_splitter.split_text(text)
60 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
61 | checkpoint = checkpoint
62 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
63 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
64 | device_map='auto',
65 | torch_dtype=torch.float32)
66 | ### INITIALIZING PIPELINE
67 | pipe_sum = pipeline('summarization',
68 | model = base_model,
69 | tokenizer = tokenizer,
70 | max_length = 350,
71 | min_length = 25
72 | )
73 | ## START TIMER
74 | start = datetime.datetime.now() #not used now but useful
75 | ## START CHUNKING
76 | full_summary = ''
77 | for cnk in range(len(texts)):
78 | result = pipe_sum(texts[cnk])
79 | full_summary = full_summary + ' '+ result[0]['summary_text']
80 | stop = datetime.datetime.now() #not used now but useful
81 | ## TIMER STOPPED AND RETURN DURATION
82 | delta = stop-start
83 | ### Calculating Summarization PERCENTAGE
84 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
85 | print(f"Completed in {delta}")
86 | print(f"Reduction percentage: ", reduction)
87 |
88 | return full_summary, delta, reduction
89 |
90 |
91 | global text_summary
92 |
93 | ### HEADER section
94 | st.image('Headline-text.jpg', width=750)
95 | title = st.text_area('Insert here your Copy/Paste text', "", height = 350, key = 'copypaste')
96 | btt = st.button("1. Start Summarization")
97 | txt = st.empty()
98 | timedelta = st.empty()
99 | text_lenght = st.empty()
100 | redux_bar = st.empty()
101 | st.divider()
102 | down_title = st.empty()
103 | down_btn = st.button('2. Download Summarization')
104 | text_summary = ''
105 |
106 | def start_sum(text):
107 | if st.session_state.copypaste == "":
108 | st.warning('You need to paste some text...', icon="⚠️")
109 | else:
110 | with st.spinner('Initializing pipelines...'):
111 | st.success(' AI process started', icon="🤖")
112 | print("Starting AI pipelines")
113 | text_summary, duration, reduction = AI_SummaryPL(LaMini,text,3700,500)
114 | txt.text_area('Summarized text', text_summary, height = 350, key='final')
115 | timedelta.write(f'Completed in {duration}')
116 | text_lenght.markdown(f"Initial length = {len(text.split(' '))} words / summarization = **{len(text_summary.split(' '))} words**")
117 | redux_bar.progress(len(text_summary)/len(text), f'Reduction: **{reduction}**')
118 | down_title.markdown(f"## Download your text Summarization")
119 |
120 |
121 |
122 | if btt:
123 | start_sum(st.session_state.copypaste)
124 |
125 | if down_btn:
126 | def savefile(generated_summary, filename):
127 | st.write("Download in progress...")
128 | with open(filename, 'w') as t:
129 | t.write(generated_summary)
130 | t.close()
131 | st.success(f'AI Summarization saved in {filename}', icon="✅")
132 | savefile(st.session_state.final, 'text_summarization.txt')
133 | txt.text_area('Summarized text', st.session_state.final, height = 350)
134 |
135 |
136 |
137 |
--------------------------------------------------------------------------------
/on MediumRepo/LaMini-TextSummarizer.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import ssl
4 | ############# Displaying images on the front end #################
5 | st.set_page_config(page_title="Summarize and Talk ot your Text",
6 | page_icon='📖',
7 | layout="centered", #or wide
8 | initial_sidebar_state="expanded",
9 | menu_items={
10 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
11 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
12 | 'About': "# This is a header. This is an *extremely* cool app!"
13 | },
14 | )
15 | ########### SSL FOR PROXY ##############
16 | ssl._create_default_https_context = ssl._create_unverified_context
17 |
18 | #### IMPORTS FOR AI PIPELINES ###############
19 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
20 | from transformers import pipeline
21 |
22 | from transformers import AutoModel, T5Tokenizer, T5Model
23 | from transformers import T5ForConditionalGeneration
24 | from langchain.llms import HuggingFacePipeline
25 | import torch
26 | import datetime
27 |
28 | #############################################################################
29 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
30 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
31 | # ###########################################################################
32 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
33 | LaMini = './model/'
34 |
35 | ######################################################################
36 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
37 | ######################################################################
38 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
39 |
40 | """
41 | checkpoint is in the format of relative path
42 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
43 | text it is either a long string or a input long string or a loaded document into string
44 | chunks: integer, lenght of the chunks splitting
45 | ovelap: integer, overlap for cor attention and focus retreival
46 | RETURNS full_summary (str), delta(str) and reduction(str)
47 |
48 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
49 | USAGE EXAMPLE:
50 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
51 | """
52 | from langchain.text_splitter import RecursiveCharacterTextSplitter
53 | text_splitter = RecursiveCharacterTextSplitter(
54 | # Set a really small chunk size, just to show.
55 | chunk_size = chunks,
56 | chunk_overlap = overlap,
57 | length_function = len,
58 | )
59 | texts = text_splitter.split_text(text)
60 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
61 | checkpoint = checkpoint
62 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
63 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
64 | device_map='auto',
65 | torch_dtype=torch.float32)
66 | ### INITIALIZING PIPELINE
67 | pipe_sum = pipeline('summarization',
68 | model = base_model,
69 | tokenizer = tokenizer,
70 | max_length = 350,
71 | min_length = 25
72 | )
73 | ## START TIMER
74 | start = datetime.datetime.now() #not used now but useful
75 | ## START CHUNKING
76 | full_summary = ''
77 | for cnk in range(len(texts)):
78 | result = pipe_sum(texts[cnk])
79 | full_summary = full_summary + ' '+ result[0]['summary_text']
80 | stop = datetime.datetime.now() #not used now but useful
81 | ## TIMER STOPPED AND RETURN DURATION
82 | delta = stop-start
83 | ### Calculating Summarization PERCENTAGE
84 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
85 | print(f"Completed in {delta}")
86 | print(f"Reduction percentage: ", reduction)
87 |
88 | return full_summary, delta, reduction
89 |
90 |
91 | global text_summary
92 |
93 | ### HEADER section
94 | st.image('Headline-text.jpg', width=750)
95 | title = st.text_area('Insert here your Copy/Paste text', "", height = 350, key = 'copypaste')
96 | btt = st.button("1. Start Summarization")
97 | txt = st.empty()
98 | timedelta = st.empty()
99 | text_lenght = st.empty()
100 | redux_bar = st.empty()
101 | st.divider()
102 | down_title = st.empty()
103 | down_btn = st.button('2. Download Summarization')
104 | text_summary = ''
105 |
106 | def start_sum(text):
107 | if st.session_state.copypaste == "":
108 | st.warning('You need to paste some text...', icon="⚠️")
109 | else:
110 | with st.spinner('Initializing pipelines...'):
111 | st.success(' AI process started', icon="🤖")
112 | print("Starting AI pipelines")
113 | text_summary, duration, reduction = AI_SummaryPL(LaMini,text,3700,500)
114 | txt.text_area('Summarized text', text_summary, height = 350, key='final')
115 | timedelta.write(f'Completed in {duration}')
116 | text_lenght.markdown(f"Initial length = {len(text.split(' '))} words / summarization = **{len(text_summary.split(' '))} words**")
117 | redux_bar.progress(len(text_summary)/len(text), f'Reduction: **{reduction}**')
118 | down_title.markdown(f"## Download your text Summarization")
119 |
120 |
121 |
122 | if btt:
123 | start_sum(st.session_state.copypaste)
124 |
125 | if down_btn:
126 | def savefile(generated_summary, filename):
127 | st.write("Download in progress...")
128 | with open(filename, 'w') as t:
129 | t.write(generated_summary)
130 | t.close()
131 | st.success(f'AI Summarization saved in {filename}', icon="✅")
132 | savefile(st.session_state.final, 'text_summarization.txt')
133 | txt.text_area('Summarized text', st.session_state.final, height = 350)
134 |
135 |
136 |
137 |
--------------------------------------------------------------------------------
/test-translation_en_to_kr.py:
--------------------------------------------------------------------------------
1 | import ssl
2 | ########### SSL FOR PROXY ##############
3 | ssl._create_default_https_context = ssl._create_unverified_context
4 |
5 |
6 | #### IMPORTS FOR AI PIPELINES ###############
7 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
8 | from transformers import pipeline
9 |
10 | from transformers import AutoModel, T5Tokenizer, T5Model
11 | from transformers import T5ForConditionalGeneration
12 | from langchain.llms import HuggingFacePipeline
13 | import torch
14 | #from functools import reduce #for highlighter
15 | #from itertools import chain #for highlighter
16 | import datetime
17 | import os
18 | import requests
19 | from langchain.embeddings import HuggingFaceEmbeddings #for using HugginFace models
20 | from langchain import HuggingFaceHub
21 | # PUT HERE YOUR HUGGING FACE API TOKEN shuold start with hf_XXXXXXX...
22 | os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"
23 | #############################################################################
24 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
25 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
26 | # ###########################################################################
27 | # Model_KR = MODEL FOR TRANSLATION ONLY FROM EN TO KR
28 | #from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
29 | #tokenizer = AutoTokenizer.from_pretrained("hcho22/opus-mt-ko-en-finetuned-en-to-kr")
30 | #model = AutoModelForSeq2SeqLM.from_pretrained("hcho22/opus-mt-ko-en-finetuned-en-to-kr")
31 | # THIS MODEL is H5 requires TENSORFLOW installed
32 | #model_ttKR = AutoModelForSeq2SeqLM.from_pretrained(Model_KR, from_tf=True) #for tensorflow
33 | ###########################################################################################
34 | LaMini = "./model/" #it is actually LaMini-Flan-T5-248M
35 | #Model_KR = './model_kr/' #tensorflow
36 | Model_KR = './opus-en-ko/' #torch
37 |
38 | from rich import console
39 | from rich.console import Console
40 | from rich.panel import Panel
41 | from rich.text import Text
42 | from functools import reduce
43 | from itertools import chain
44 | import datetime
45 |
46 | console = Console()
47 |
48 |
49 | pippo = """
50 | Introduction
51 | We rely on google translate services to help us with translating text from one language to other and I always wanted to develop an app like that and to know how translation works in the backend.
52 |
53 | Let’s list down the components which will be there in our translate app.
54 |
55 | These are those components -
56 |
57 | A Multi-language translation model.
58 | An API service which takes all the necessary parameters sends those parameters to the model and returns the translated text back as response.
59 | A front-end app which provides a GUI to the user to interact with.
60 | An ideal flow will look like user typing a text in the input text, selecting the desired target language and clicking the translate button. Once the button is clicked, we get the request data and send to the API service which eventually passes that to the model and get the results back as a response. The response is then shown into the UI.
61 |
62 | The Big Challenge
63 | The ML model is going to be the brain behind our translation app. To train a state-of-the-art model from scratch we would need the following things -
64 |
65 | Huge amounts of training data containing examples of text in one language and its translation in the other language.
66 | Create a neural network model which consists of more than a million parameters.
67 | A high end multi-GPU based environment to train that model.
68 | Time.
69 | But my goal here is to develop a MVP or a small POC which could help me understand and demonstrate the process of making a translation app and be able to complete this over a weekend.
70 |
71 | HuggingFace to the rescue
72 | The solution is that we can use a pre-trained model which is trained for translation tasks and can support multiple languages.
73 |
74 | HuggingFace consists of an variety of transformers/pre-trained models. One of the translation models is MBart which was presented by Facebook AI research team in 2020 — Multilingual Denoising Pre-training for Neural Machine Translation.
75 |
76 | Great!! Now that we have a pre-trained model in place. Time to put the pieces of the puzzle in the right places.
77 | """
78 | with console.status("Preparing model and pipeline...", spinner="material"):
79 | from langchain.text_splitter import CharacterTextSplitter
80 | # TEXT SPLITTER FUNCTION FOR CHUNKING
81 | text_splitter = CharacterTextSplitter(
82 | separator = "\n\n",
83 | chunk_size = 300,
84 | chunk_overlap = 0,
85 | length_function = len,
86 | )
87 | # CHUNK THE DOCUMENT
88 | console.print('[bold blue] Chunking the text...')
89 | texts = text_splitter.create_documents([pippo])
90 | console.print('[bold red] Inizialize AI toknizer...')
91 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
92 | # INITIALIZE TRANSLATION FROM ENGLISH TO ITALIAN
93 |
94 | tokenizer_KR = AutoTokenizer.from_pretrained(Model_KR) #google/byt5-small #facebook/m2m100_418M
95 | console.print('[bold green] Inizialize AI model...')
96 | #model_ttKR = AutoModelForSeq2SeqLM.from_pretrained(Model_KR, from_tf=True) #Helsinki-NLP/opus-mt-en-it or #Helsinki-NLP/opus-mt-it-en
97 | model_ttKR = AutoModelForSeq2SeqLM.from_pretrained(Model_KR)
98 | TToKR = pipeline("translation", model=Model_KR, tokenizer=tokenizer_KR)
99 | # Example TToIT("How old are you?")[0]['translation_text']
100 |
101 | # ITERATE OVER CHUNKS AND JOIN THE TRANSLATIONS
102 | finaltext = ''
103 | start = datetime.datetime.now() #not used now but useful
104 | console.print('[bold yellow] Translation in progress...')
105 | with console.status("Translation to Korean in progress...", spinner="pong"):
106 | for item in texts:
107 | line = TToKR(item.page_content)[0]['translation_text']
108 | finaltext = finaltext+line+'\n'
109 | stop = datetime.datetime.now() #not used now but useful
110 | elapsed = stop - start
111 | console.print(f'[bold underline green1] Translation generated in [reverse dodger_blue2]{elapsed}[/reverse dodger_blue2]...')
112 | console.print(Panel(finaltext, title="AI Translatio", title_align="center"))
113 |
114 |
115 |
--------------------------------------------------------------------------------
/1172x368.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/working/YoutubeSummarizer.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import pandas as pd
4 | from io import StringIO
5 | from PIL import Image
6 | import ssl
7 | from pytube import YouTube as YT
8 | import re
9 | import textwrap
10 |
11 |
12 | ############# Displaying images on the front end #################
13 | st.set_page_config(page_title="Summarize and Talk ot your Documents",
14 | page_icon='📖',
15 | layout="centered", #or wide
16 | initial_sidebar_state="expanded",
17 | menu_items={
18 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
19 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
20 | 'About': "# This is a header. This is an *extremely* cool app!"
21 | },
22 | )
23 | ########### SSL FOR PROXY ##############
24 | ssl._create_default_https_context = ssl._create_unverified_context
25 |
26 |
27 | #### IMPORTS FOR AI PIPELINES ###############
28 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
29 | from transformers import pipeline
30 |
31 | from transformers import AutoModel, T5Tokenizer, T5Model
32 | from transformers import T5ForConditionalGeneration
33 | from langchain.llms import HuggingFacePipeline
34 | import torch
35 | #from functools import reduce #for highlighter
36 | #from itertools import chain #for highlighter
37 | import datetime
38 |
39 | #############################################################################
40 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
41 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
42 | # ###########################################################################
43 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
44 | LaMini = './model/'
45 |
46 |
47 | ######################################################################
48 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
49 | ######################################################################
50 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
51 |
52 | """
53 | checkpoint is in the format of relative path
54 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
55 | text it is either a long string or a input long string or a loaded document into string
56 | chunks: integer, lenght of the chunks splitting
57 | ovelap: integer, overlap for cor attention and focus retreival
58 | RETURNS full_summary (str), delta(str) and reduction(str)
59 |
60 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
61 | USAGE EXAMPLE:
62 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
63 | """
64 | from langchain.text_splitter import RecursiveCharacterTextSplitter
65 | text_splitter = RecursiveCharacterTextSplitter(
66 | # Set a really small chunk size, just to show.
67 | chunk_size = chunks,
68 | chunk_overlap = overlap,
69 | length_function = len,
70 | )
71 | texts = text_splitter.split_text(text)
72 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
73 | checkpoint = checkpoint
74 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
75 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
76 | device_map='auto',
77 | torch_dtype=torch.float32)
78 | ### INITIALIZING PIPELINE
79 | pipe_sum = pipeline('summarization',
80 | model = base_model,
81 | tokenizer = tokenizer,
82 | max_length = 350,
83 | min_length = 25
84 | )
85 | ## START TIMER
86 | start = datetime.datetime.now() #not used now but useful
87 | ## START CHUNKING
88 | full_summary = ''
89 | for cnk in range(len(texts)):
90 | result = pipe_sum(texts[cnk])
91 | full_summary = full_summary + ' '+ result[0]['summary_text']
92 | stop = datetime.datetime.now() #not used now but useful
93 | ## TIMER STOPPED AND RETURN DURATION
94 | delta = stop-start
95 | ### Calculating Summarization PERCENTAGE
96 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
97 | print(f"Completed in {delta}")
98 | print(f"Reduction percentage: ", reduction)
99 |
100 | return full_summary, delta, reduction
101 |
102 | with open('video_transcript.txt') as f:
103 | testo = f.read()
104 | f.close()
105 |
106 | global video_summary
107 | global processed
108 | processed = False
109 | #with st.columns(3)[1]:
110 | # st.header("hello world")
111 | # st.image("http://placekitten.com/200/200")
112 | #st.title('🏞️ Image display methods')
113 | #
114 | #st.write("a logo and text next to eachother")
115 | #
116 |
117 | ### HEADER section
118 | col1, col2, col3 = st.columns([1,20, 1])
119 | col2.image('youtubelogo.jpg', width=180)
120 | col2.title('AI Summarizer')
121 |
122 | #image_path = 'logo.png'
123 | #image = Image.open(image_path)
124 | #st.image('https://streamlit.io/images/brand/streamlit-mark-light.png')
125 | #st.image(image_path, width = 700)
126 |
127 |
128 | title = st.text_input('1. Input your Youtube Video url', 'Something like https://youtu.be/SCYMLHB7cfY....') #https://youtu.be/SCYMLHB7cfY
129 | videotitle = st.empty()
130 |
131 | btt = st.empty()
132 | txt = st.empty()
133 | video_dur = st.empty()
134 | video_redux = st.empty()
135 | st.divider()
136 |
137 | video_summary = ''
138 | def start_sum(text):
139 | if '...' in text:
140 | st.warning('Wrong youtube video link! Type a valid url like https://youtu.be/SCYMLHB7cfY', icon="⚠️")
141 | else:
142 | videotitle.markdown(f"Video title: Youyube Video Title")
143 | print("Starting AI pipelines")
144 | video_summary, duration, reduction = AI_SummaryPL(LaMini,testo,3700,500)
145 | txt.text_area('Summarized text', video_summary, height = 450, key = 'result')
146 | video_dur.markdown(f'Processing time :clock3: :, {duration}')
147 | video_redux.markdown(f"Percentage of reduction: {reduction} {len(video_summary.split(' '))}/{len(testo.split(' '))} words")
148 | processed = True #stuts for the download button
149 | print(processed)
150 |
151 | if btt.button('2. Start Summarization', key='start'):
152 | with st.spinner('Initializing pipelines...'):
153 | st.success(' AI process started', icon="🤖")
154 | start_sum(title)
155 | else:
156 | st.write('Insert the video url in the input box above...')
157 |
158 | if st.button('3. Download Summarization'):
159 | st.markdown(f"## Download your YouTube Video Summarization")
160 | def savefile(generated_summary, filename):
161 | st.write("Download in progress...")
162 | with open(filename, 'w') as t:
163 | t.write(generated_summary)
164 | t.close()
165 | st.success(f'AI Summarization saved in {filename}', icon="✅")
166 | savefile(st.session_state.result, 'video_summarization.txt')
167 |
168 |
169 |
170 |
--------------------------------------------------------------------------------
/my_data/Derivative Tuning for PID Control.txt:
--------------------------------------------------------------------------------
1 | Title: Derivative Tuning for PID Control - ControlSoft
2 | ------------------------------------------------------
3 | source: https://www.controlsoftinc.com/derivative-tuning-for-pid-control/
4 |
5 | Derivative Tuning for PID Control
6 |
7 | Of the three letters in PID the D, Derivative Tuning, is probably the most misunderstood and certainly the least used! It is well-known that Derivative Tuning is largely unnecessary for fast loops (such as flows and pressures) due to those loops’ naturally quick response time. However, derivative tuning is extremely useful for particularly slow loops (such as temperature) and is an absolute must-have for integrating processes (such as level and insulated temperature loops).
8 |
9 | Why is Derivative Tuning Important?
10 |
11 | The answer is in the predictive action D provides. When you drive a car and approach a stop sign, you need to apply the brake. But how much brake should you apply? Among other things, the answer depends on how fast you are going and how close you are to the stop sign. The closer you are and the faster you are going, the more brake you need in order to reach the stop sign safely. Applying too much brake or applying it too early means you will not reach the stop sign. Applying too little brake or applying it late means you will travel past the stop sign.
12 |
13 | It’s the same in process control. Derivative action is a predictive brake for the controller. We want to use the right amount of D (the brake) so that we get to setpoint quickly (the stop sign) without overshooting (traveling past the goal) or undershooting (stopping short of the goal). And as anyone performing loop tuning can testify, getting the right amount of D is often a difficult task.
14 |
15 | To help understand derivative controller action, let’s examine two common misuses of the D term. To aid our study, we’ll examine the effects of the D term on a 3rd order integrating process with a gain rate of 0.34, a deadtime of 40 seconds, a lagtime of approximately 90 seconds, and some higher order dynamics as well. In the following examples, the PID controller uses a non-interacting equation when performing its calculations.
16 |
17 | NOTE: Given the integrating nature of this process, PID control in these examples was used with an extremely weak integral term. This essentially created a PD-only tuning situation since the integrating (I) action is naturally built into the process dynamics.
18 |
19 | Misuse Example #1: Derivative Tuning Action Too Strong
20 |
21 | Figure 1 shows the reaction of our process to a PID controller with a Derivative tuning setting that is too strong.
22 |
23 | Figure 1. P = 0.6, D = 100. Non-interacting (ISA) equation.
24 |
25 | As the trend indicates, so much braking force is applied to the process that our derivative action actually hinders the PV from reaching setpoint. Not only that, but because the derivative action is so aggressive, it is preemptively acting on what it predicts will soon be a negative SP-PV error (though the actual error remains positive). This preemptive action on a predicted negative error is what causes the saw-tooth pattern in the PV and the CO.
26 |
27 | Misuse Example #2: Derivative Tuning Action Too Weak
28 |
29 | Figure 2 shows the reaction of our process to a PID controller with a Derivative tuning setting that is too weak.
30 |
31 | Figure 2. P = 0.6, D = 5. Non-interacting (ISA) equation.
32 |
33 | As the trend indicates, so little brake is applied to the process that our PV far overshoots its goal. For this process, product quality is significantly affected due to poor derivative tuning, and this particular company spends far more on natural gas to heat this loop in startup than is necessary. Multiplied out over similar loops in the rest of the plant, this is a significant energy and product resource waste—all from a poor use of derivative action in PID control.
34 |
35 | Correct Use Example #3: Derivative Tuning Action Just Right
36 |
37 | Figure 3 shows the reaction of our process to a PID controller with a Derivative tuning setting that is just right.
38 |
39 | Figure 3. P = 0.6, D = 50. Non-interacting (ISA) equation.
40 |
41 | Improving Process Control Performance with PID Loop Tuning Software
42 |
43 | In an ideal world, we could easily estimate the right amount of derivative action for a process. Of course, we don’t live in an ideal world and trying to understand the proper amount of D to use can be a long and complicated trial-and-error process, especially if you are tuning slower temperature and level loops.
44 |
45 | What if you could have a PID loop professional monitor your process 24 hours a day, 7 days a week and continuously make recommendations about how to adjust your P, I, and D for increasingly better control?
46 |
47 | ControlSoft’s Advisory AdaptTune technology, available in INTUNE PID Loop Tuning Tools, provides that exact service.
48 |
49 | In a recent test, the INTUNE AdaptTune function was allowed to monitor a difficult-to-tune, slow-responding temperature loop that was taking days to tune by hand. Even though it was understood that strong PD control was the proper approach to tuning this loop, the operator could not find the right combination of P and D.
50 |
51 | Over the course of just several hours (not days), AdaptTune made several minor recommendations on the P term and several suggested adjustments for the D term to bring the loop under increasingly better control. While in Auto mode, the setpoint was stepped up and down at half-hour intervals to test the response of the new tuning recommendations. The steps were within a healthy operating range for the process.
52 |
53 | Figure 4. AdaptTune first and second recommendations, indicated by red arrows.
54 |
55 | Figure 5. AdaptTune third recommendation, indicated by red arrow.
56 |
57 | Figure 6. AdaptTune final recommendation, time indicated by red arrow.
58 |
59 | Table 1 shows the recommendations AdaptTune used to bring the loop under control.
60 |
61 | Table 1. INTUNE AdaptTune Recommendations.
62 |
63 | Initial 1st Iteration 2nd Iteration 3rd Iteration 4th Iteration
64 | P 0.76 0.68 0.45 0.36 0.32
65 | I 3000 3000 3000 3000 3000
66 | D 10 12 18 27 33
67 |
68 | By gradually reducing the Proportional action and strongly increasing the Derivative action, AdaptTune was able to tame one of the operator’s toughest, slowest, and most problematic control loops.
69 |
70 | So, Don’t Deny the D. If properly used, derivative control can enhance your controller response for integrating processes and particularly slow loops. Use it with confidence.
71 |
72 | Final Word of Caution: Derivative Action Applied to a Noisy Process
73 |
74 | As useful as the D term is, and as liberally as it can be applied in the case of slow, integrating processes, Figures 7a and 7b show why it is better to avoid a lot of D if you are dealing with an inherently noisy process.
75 |
76 | Figure 7a. Response of PV to a noisy process with D term included in the tuning.
77 |
78 | Figure 7b. Response of PV to a noisy process without D term included in the tuning.
79 |
80 | To help alleviate this problem, a maintenance team should check the sensors, valves, and other hardware to make sure you are receiving the clearest, most accurate information possible.
81 |
82 | You also want to make sure that what appears to be noise is not really a disturbance issue coming from some other aspect of the process upstream.
83 |
84 | [This article was first published in 2007 and has been revised for comprehensiveness.]
85 |
--------------------------------------------------------------------------------
/video_transcript.txt:
--------------------------------------------------------------------------------
1 | TITLE: GENIUS Reason Why Toyota Refuses To Switch To EVs!
2 | video Duration: 0:08:38
3 | ----------------------------------------
4 | demand for electric vehicles is clearly growing more Australians are buckling up in the race to go electric it's Toyota behind everyone else or is it smart to be cautious now thanks to advances in technology EVS are becoming popular in more affordable worldwide so you would expect Toyota to be at the Forefront leading the charge for an all-electric future for the Auto industry right not at all in fact the leadership of the Legacy car maker has stated numerous times that the company won't make the switch to EVS join us in this video where we uncover the real reason why Toyota has refused to make the all-important switch to EVs and why the reason is just brilliant but in order to understand Toyota's stance let's briefly catch you up to speed with why this is a big deal the EV Revolution was spearheaded by Tesla in the early 2010s and today we have lots of automakers dedicated to providing citizens of the world with a cleaner and better way of moving from one place to another apart from Tesla EV company such as rivien Lucid Motors byd and IO and many others are also taking up the charge Legacy car makers such as Ford Hyundai and Volkswagen have also invested billions of dollars into the process of building EVS for their customer base now looking at Toyota's Global sales numbers its top five markets are the US China Japan Canada and Australia all countries that have a high EV adoption rate and government transition incentives for making the switch the Japanese giant known for its hybrid cars has been slow to hop on the EV Trend and answer other companies feel is the way forward to tackle the climate problem instead they want to focus on other propulsion methods such as hydrogen hybrid and internal combustion engines or ice its former CEO who also happens to be the grandson of the company's founder Akio Toyota he had also voiced his opinions on the idea that EVS are the solution to the climate issues arising from using gas-powered vehicles and internal combustion engines Mr Toyota reiterated that his opinion is simply not true by pointing out the inconsistency in the argument led by Tesla and other EV experts after Akio Toyota step down from his position as CEO earlier in 2023 and former head of Lexis Koji Sato replaced him many people expected Toyota's position on the transition to EVS to change but Toyota shocked everyone again because Koji Sato reinstated the plan earlier stated by his predecessor that they have no plans of fully switching to EVS there's a reason why Toyota is the largest automaker because they planned for the next 20 50 and 100 years not the next five or ten the question that then follows is why what is Toyota's play here why has the auto giant refused to take the obvious route and join the EV Trend especially as it is seemingly the future of automobiles well the answer to these questions is multifaceted to start with the brilliant reason why Toyota has decided not to switch to EVS is born out of its understanding of the auto market Toyota knows that as long as there is freedom of choice there will be those who would not touch an EV with a 10-foot pole and would always go for an internal combustion engine-based gasoline vehicle this is why as Legacy automakers hopped on the EV Trend to take a loss on every EV sale they make Toyota will make a profit it is simply the truth about life we all cannot want the same thing and when it comes to Vehicles this concept remains valid so with this knowledge in mind it is clear that Toyota opt to position itself as the go-to option for people who still want to feel the powerful Roar of a gas-powered engine and take a drive without having to worry about charging in range basically Toyota's strategy is to please the widest possible range of customers with the widest possible range of power trades be it hybrids like the Prius hydrogen fuel cell vehicles or traditional internal combustion engine based gas-powered cars another reason why Toyota has refused to make the switch could be linked to the understanding of the reality of EV adoption in other words EV adoption is not as rosy as we have been made to believe there are many regions and parts of the world that have inadequate or zero infrastructure for robust EV adoption many countries in Africa and some parts of Asia have very little or no charging stations repair Maintenance and Service Centers for EVS as well as the relevant financial and credit facilities to facilitate the purchase of EVS without these important structures it is clear that these regions of the world would still rely on gas-powered vehicles for the nearest Future and Toyota seeks to remain their go-to option you may also want to consider this reason Toyota does not believe in electrification when it comes to cleaner energy sources instead the company has developed another type of fuel that could power vehicles and result in cleaner and more efficient cars this fuel type is hydrogen gas Toyota has been at the Forefront of championing the adoption of hydrogen fuel for cars the company went as far as building the world's first vehicle that runs on hydrogen gas the Toyota Mirai the Mirai is the best-selling hydrogen fuel cell car of all time since its launch in 2015 it has sold over 21 000 units that's how well the automaker believes in its idea of hydrogen fuel cells now it would be strange for a company that has set its sights on the development of hydrogen fuel to be overly involved in pushing for the adoption of yet another alternative fuel it would be like expecting Tesla to start making gas-powered Vehicles almost impossible right apart from leading the charge on the use of hydrogen fuel for cars Toyota is also focused on hybrid cars the company has been a Pioneer in hybrid technology with successful models like the Toyota Prius they have invested heavily in developing hybrid powertrains and have achieved significant success with them Toyota may see hybrid technology is a more practical and proven option with existing manufacturing capabilities and infrastructure and may choose to focus on improving and expanding its hybrid lineup rather than fully transitioning to EVS we can also make a case for technical challenges as one of the reasons why Toyota has refused to make the complete switch to EVS let's not forget that EVS come with their own set of technical challenges such as Battery Technology charging times and range limitations in fact we saw Toyota struggle with its first and only all-electric vehicle the bz-4x the bz series was Toyota's first entry into the pure EV market and while the crossover SUV had a Sleek design and a decent performance the company had to recall and stop the sale of this EV SUV in June 2022 due to issues with Hub bolts on the wheels coming loose in the event of hard braking or a sharp turn while this issue has been fixed and the sale of the EV has continued the company never really shook off this piece of bad PR that experience is not the kind of thing someone would expect from the largest car maker in the world with a reputation for building reliable low maintenance and high quality vehicles for over 80 years so it is possible that Toyota may be working to overcome these challenges or waiting for further advancements in technology for building EVS before committing to large-scale EV production they may also be concerned about potential issues related to the supply chain and availability of critical components like rare earth metals used in EV batteries it's important to note that many of the reasons mentioned in this video are speculative and may not reflect the current or complete reasoning behind Toyota's decision-making process Toyota is a large arguably the largest automaker and we know that automaker strategies and decisions are influenced by a multitude of factors including market research consumer preferences technological advancements regulatory policies and economic consideration among others
--------------------------------------------------------------------------------
/working/Mon_May_29_11-46-01_2023.svg:
--------------------------------------------------------------------------------
1 |
98 |
--------------------------------------------------------------------------------
/my_data/nlp-basics-abstractive-and-extractive-text-summarization.txt:
--------------------------------------------------------------------------------
1 | Title: NLP Basics: Abstractive and Extractive Text Summarization
2 | ----------------------------------------------------------------
3 | source: https://www.scrapehero.com/nlp-basics-abstractive-and-extractive-text-summarization/
4 |
5 | Summarization is one of the most common tasks that we perform in Natural Language Processing (NLP). With the amount of new content generated by billions of people and their smartphones everyday, we are inundated with increasing amount of data every day. Humans can only consume a finite amount of information and need a way to filter out the wheat from the chaff and find the information that matters. Text summarization can help achieve that for textual information. We can separate the signal from the noise and take meaningful actions from them.
6 |
7 | In this article, we explore different methods to implement this task and some of the learnings that we have come across on the way. We hope this will be helpful to other folks who would like to implement basic summarization in their data science pipeline for solving different business problems.
8 |
9 | Python provides some excellent libraries and modules to perform Text Summarization. We will provide a simple example of generating Extractive Summarization using the Gensim and HuggingFace modules in this article. We will explore other models and modules in upcoming articles in this series.
10 |
11 | When to use Summarization?
12 |
13 |
14 | It may be tempting to use summarization for all texts to get useful information from them and spend less time reading. However, for now, NLP summarization has been a successful use case in only a few areas.
15 |
16 | Text summarization works great if a text has a lot of raw facts and can be used to filter important information from them. The NLP models can summarize long documents and represent them in small simpler sentences. News, factsheets, and mailers fall under these categories.
17 |
18 | However, for texts where each sentence builds up upon the previous, text summarization does not work that well. Research journals, medical texts are good examples of texts where summarization might not be very successful.
19 |
20 | Finally, if we take the case of summarizing fiction, summarization methods can work fine. However, it might miss the style and the tone of the text that the author tried to express.
21 |
22 | Hence, Text summarization is helpful only in a handful of use cases.
23 |
24 | Two Types Of Summarization
25 |
26 | There are two main types of Text Summarizations
27 |
28 | Extractive
29 |
30 | Extractive summarization methods work just like that. It takes the text, ranks all the sentences according to the understanding and relevance of the text, and presents you with the most important sentences.
31 |
32 | This method does not create new words or phrases, it just takes the already existing words and phrases and presents only that. You can imagine this as taking a page of text and marking the most important sentences using a highlighter.
33 |
34 | Abstractive
35 |
36 | Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and presents the meaning to you.
37 |
38 | It creates words and phrases, puts them together in a meaningful way, and along with that, adds the most important facts found in the text. This way, abstractive summarization techniques are more complex than extractive summarization techniques and are also computationally more expensive.
39 |
40 | Comparison of both summarization types
41 |
42 | The best way to illustrate these types is through an example. Here we have run the Input Text below through both types of summarization and the results are shown below.
43 |
44 | Input Text:
45 |
46 | China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless, Huawei’s position as number one seller may prove short-lived once other markets recover given it is mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.
47 |
48 | Extractive Summarization Output:
49 |
50 | While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
51 |
52 | Abstractive Summarization Output:
53 |
54 | Huawei overtakes Samsung as world’s biggest seller of mobile phones in the second quarter of 2020. Sales of Huawei’s 55.8 million devices compared to 53.7 million for south Korea’s Samsung. Shipments overseas fell 27 per cent in Q2 from a year earlier, but company increased its dominance of the china market. Position as number one seller may prove short-lived once other markets recover, a senior Huawei employee says.
55 |
56 | Extractive Text Summarization Using Gensim
57 |
58 | Import the required libraries and functions:
59 |
60 | from gensim.summarization.summarizer import summarize
61 |
62 | from gensim.summarization.textcleaner import split_sentences
63 |
64 | We store the article content in a variable called Input (mentioned above). Next, we have to pass it to the summarize function, the second parameter being the ratio we want the summarized text to be. We chose it as 0.4, or the summary will be around 40% of the original text.
65 |
66 | summarize(Input, 0.4)
67 |
68 | Output:
69 |
70 | While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its dominance of the China market which has been faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
71 |
72 | With the parameter split=True, you can see the output as a list of sentences.
73 |
74 | Gensim summarization works with the TextRank algorithm. As the name suggests, it ranks texts and gives you the most important ones back.
75 |
76 | Extractive Text Summarization Using Huggingface Transformers
77 |
78 | We use the same article to summarize as before, but this time, we use a transformer model from Huggingface,
79 |
80 | from transformers import pipeline
81 |
82 | We have to load the pre-trained summarization model into the pipeline:
83 |
84 | summarizer = pipeline("summarization")
85 |
86 | Next, to use this model, we pass the text, the minimum length, and the maximum length parameters. We get the following output:
87 |
88 | summarizer(Input, min_length=30, max_length=300)
89 |
90 | Output:
91 |
92 | China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million. Samsung posted a bigger drop of 30 per cent, owing to disruption from coronavirus in key markets such as Brazil, the United States and Europe.
93 |
94 | Where can you get the data from?
95 |
96 | You can scrape news website to get the data to try these summarization techniques. If you aren’t keen on building scrapers to collect this data, you can try our News API for FREE.
97 |
98 | Conclusion
99 |
100 | We saw some quick examples of Extractive summarization, one using Gensim’s TextRank algorithm, and another using Huggingface’s pre-trained transformer model. In the next article in this series, we will go over LSTM, BERT, and Google’s T5 transformer models in-depth and look at how they work to do tasks such as abstractive summarization.
101 |
102 | Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
--------------------------------------------------------------------------------
/st-LaMini-YoutubeSummarizer.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import pandas as pd
4 | from io import StringIO
5 | from PIL import Image
6 | import ssl
7 | from pytube import YouTube as YT
8 | import re
9 | import textwrap
10 |
11 |
12 | ############# Displaying images on the front end #################
13 | st.set_page_config(page_title="Summarize and Talk ot your Documents",
14 | page_icon='📖',
15 | layout="centered", #or wide
16 | initial_sidebar_state="expanded",
17 | menu_items={
18 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
19 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
20 | 'About': "# This is a header. This is an *extremely* cool app!"
21 | },
22 | )
23 | ########### SSL FOR PROXY ##############
24 | ssl._create_default_https_context = ssl._create_unverified_context
25 |
26 |
27 | #### IMPORTS FOR AI PIPELINES ###############
28 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
29 | from transformers import pipeline
30 |
31 | from transformers import AutoModel, T5Tokenizer, T5Model
32 | from transformers import T5ForConditionalGeneration
33 | from langchain.llms import HuggingFacePipeline
34 | import torch
35 | #from functools import reduce #for highlighter
36 | #from itertools import chain #for highlighter
37 | import datetime
38 |
39 | #############################################################################
40 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
41 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
42 | # ###########################################################################
43 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
44 | LaMini = './model/'
45 |
46 |
47 | ######################################################################
48 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
49 | ######################################################################
50 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
51 |
52 | """
53 | checkpoint is in the format of relative path
54 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
55 | text it is either a long string or a input long string or a loaded document into string
56 | chunks: integer, lenght of the chunks splitting
57 | ovelap: integer, overlap for cor attention and focus retreival
58 | RETURNS full_summary (str), delta(str) and reduction(str)
59 |
60 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
61 | USAGE EXAMPLE:
62 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
63 | """
64 | from langchain.text_splitter import RecursiveCharacterTextSplitter
65 | text_splitter = RecursiveCharacterTextSplitter(
66 | # Set a really small chunk size, just to show.
67 | chunk_size = chunks,
68 | chunk_overlap = overlap,
69 | length_function = len,
70 | )
71 | texts = text_splitter.split_text(text)
72 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
73 | checkpoint = checkpoint
74 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
75 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
76 | device_map='auto',
77 | torch_dtype=torch.float32)
78 | ### INITIALIZING PIPELINE
79 | pipe_sum = pipeline('summarization',
80 | model = base_model,
81 | tokenizer = tokenizer,
82 | max_length = 350,
83 | min_length = 25
84 | )
85 | ## START TIMER
86 | start = datetime.datetime.now() #not used now but useful
87 | ## START CHUNKING
88 | full_summary = ''
89 | for cnk in range(len(texts)):
90 | result = pipe_sum(texts[cnk])
91 | full_summary = full_summary + ' '+ result[0]['summary_text']
92 | stop = datetime.datetime.now() #not used now but useful
93 | ## TIMER STOPPED AND RETURN DURATION
94 | delta = stop-start
95 | ### Calculating Summarization PERCENTAGE
96 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
97 | print(f"Completed in {delta}")
98 | print(f"Reduction percentage: ", reduction)
99 |
100 | return full_summary, delta, reduction
101 |
102 |
103 | global video_summary
104 |
105 | #with st.columns(3)[1]:
106 | # st.header("hello world")
107 | # st.image("http://placekitten.com/200/200")
108 | #st.title('🏞️ Image display methods')
109 | #
110 | #st.write("a logo and text next to eachother")
111 | #
112 |
113 | ### HEADER section
114 | st.image('Headline.jpg', width=750)
115 | #title('AI Summarizer')
116 |
117 | #image_path = 'logo.png'
118 | #image = Image.open(image_path)
119 | #st.image('https://streamlit.io/images/brand/streamlit-mark-light.png')
120 | #st.image(image_path, width = 700)
121 |
122 |
123 | title = st.text_input('1. Input your Youtube Video url', 'Something like https://youtu.be/SCYMLHB7cfY....') #https://youtu.be/SCYMLHB7cfY
124 | videotitle = st.empty()
125 |
126 | btt = st.empty()
127 | txt = st.empty()
128 | video_dur = st.empty()
129 | video_redux = st.empty()
130 | st.divider()
131 |
132 | video_summary = ''
133 | def start_sum(text):
134 | if '...' in text:
135 | st.warning('Wrong youtube video link! Type a valid url like https://youtu.be/SCYMLHB7cfY', icon="⚠️")
136 | else:
137 | myvideo = YT(title, use_oauth=True, allow_oauth_cache=True)
138 | # required only for the first time to know what languages are aavailable
139 | print(myvideo.title)
140 | print(myvideo.captions) #print the options of languages available
141 | #Commented for automatically choice to auto-generated
142 | #code = input("input the code you want: ") #original
143 | print("Scraping subtitles...")
144 | #Commented to test the auto-generated ones
145 | #sub = myvideo.captions[code] #original
146 | sub = myvideo.captions['a.en']
147 | caption = sub.generate_srt_captions()
148 | #print(caption)#print(caption)
149 |
150 | # Club Video Title, details and Description, only for printed version
151 | # not for the Sumarization one
152 | # possible in future to prepare for Markdown to PDF export
153 | import datetime
154 | m1 = f"TITLE: {myvideo.title}"+'\n'
155 | m2 = f"thumbnail url: {myvideo.thumbnail_url}"+'\n'
156 | m4 = f"video Duration: {str(datetime.timedelta(seconds=myvideo.length))}"+'\n'
157 | m5 = "----------------------------------------"+'\n'
158 | #m6 = textwrap.fill(myvideo.description, 80)+'\n' #solution not good
159 | m6 = myvideo.description+'\n'
160 | m7 = "----------------------------------------"+'\n'
161 | m_intro = m1+m2+m4+m5+m6+m7
162 |
163 | # Function to clean up the srt text
164 | def clean_sub(sub_list):
165 | lines = sub_list
166 | text = ''
167 | for line in lines:
168 | if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
169 | text += ' ' + line.rstrip('\n')
170 | text = text.lstrip()
171 | #print(text)
172 | return text
173 |
174 | print("Transform subtitles to TEXT...")
175 | srt_list = str(caption).split('\n') #generate a list with all lines
176 | final_text = clean_sub(srt_list)
177 |
178 | to_sum_text = m1+m4+m5+final_text
179 | wrapped_text = textwrap.fill(to_sum_text, 100)
180 | with open('video_transcript.txt', 'w') as f:
181 | f.write(to_sum_text)
182 | f.close()
183 | print('File video_transcript.txt saved')
184 |
185 | videotitle.markdown(f"Video title: **{myvideo.title}**")
186 | print("Starting AI pipelines")
187 | video_summary, duration, reduction = AI_SummaryPL(LaMini,to_sum_text,3700,500)
188 | txt.text_area('Summarized text', video_summary, height = 350, key = 'result')
189 | video_dur.markdown(f'Processing time :clock3: :, {duration}')
190 | video_redux.markdown(f"Percentage of reduction: {reduction} **{len(video_summary.split(' '))}**/{len(to_sum_text.split(' '))} words")
191 |
192 |
193 | if btt.button('2. Start Summarization', key='start'):
194 | with st.spinner('Initializing pipelines...'):
195 | st.success(' AI process started', icon="🤖")
196 | start_sum(title)
197 | else:
198 | st.write('Insert the video url in the input box above...')
199 |
200 | if st.button('3. Download Summarization'):
201 | st.markdown(f"## Download your YouTube Video Summarization")
202 | def savefile(generated_summary, filename):
203 | st.write("Download in progress...")
204 | with open(filename, 'w') as t:
205 | t.write(generated_summary)
206 | t.close()
207 | st.success(f'AI Summarization saved in {filename}', icon="✅")
208 | savefile(st.session_state.result, 'video_summarization.txt')
209 | txt.text_area('Summarized text', st.session_state.result, height = 350)
210 |
211 |
212 |
213 |
--------------------------------------------------------------------------------
/LaMini-YoutubeSummarizer.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import pandas as pd
4 | from io import StringIO
5 | from PIL import Image
6 | import ssl
7 | from pytube import YouTube as YT
8 | import re
9 | import textwrap
10 |
11 |
12 | ############# Displaying images on the front end #################
13 | st.set_page_config(page_title="Summarize and Talk ot your Documents",
14 | page_icon='📖',
15 | layout="centered", #or wide
16 | initial_sidebar_state="expanded",
17 | menu_items={
18 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
19 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
20 | 'About': "# This is a header. This is an *extremely* cool app!"
21 | },
22 | )
23 | ########### SSL FOR PROXY ##############
24 | ssl._create_default_https_context = ssl._create_unverified_context
25 |
26 |
27 | #### IMPORTS FOR AI PIPELINES ###############
28 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
29 | from transformers import pipeline
30 |
31 | from transformers import AutoModel, T5Tokenizer, T5Model
32 | from transformers import T5ForConditionalGeneration
33 | from langchain.llms import HuggingFacePipeline
34 | import torch
35 | #from functools import reduce #for highlighter
36 | #from itertools import chain #for highlighter
37 | import datetime
38 |
39 | #############################################################################
40 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
41 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
42 | # ###########################################################################
43 | LaMini = './model/' # checkpoint for the Model #it is actually LaMini-Flan-T5-248M
44 |
45 |
46 | ######################################################################
47 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
48 | ######################################################################
49 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
50 |
51 | """
52 | checkpoint is in the format of relative path
53 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
54 | text it is either a long string or a input long string or a loaded document into string
55 | chunks: integer, lenght of the chunks splitting
56 | ovelap: integer, overlap for cor attention and focus retreival
57 | RETURNS full_summary (str), delta(str) and reduction(str)
58 |
59 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
60 | USAGE EXAMPLE:
61 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
62 | """
63 | from langchain.text_splitter import RecursiveCharacterTextSplitter
64 | text_splitter = RecursiveCharacterTextSplitter(
65 | # Set a really small chunk size, just to show.
66 | chunk_size = chunks,
67 | chunk_overlap = overlap,
68 | length_function = len,
69 | )
70 | texts = text_splitter.split_text(text)
71 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
72 | checkpoint = checkpoint
73 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
74 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
75 | device_map='auto',
76 | torch_dtype=torch.float32)
77 | ### INITIALIZING PIPELINE
78 | pipe_sum = pipeline('summarization',
79 | model = base_model,
80 | tokenizer = tokenizer,
81 | max_length = 350,
82 | min_length = 25
83 | )
84 | ## START TIMER
85 | start = datetime.datetime.now() #not used now but useful
86 | ## START CHUNKING
87 | full_summary = ''
88 | for cnk in range(len(texts)):
89 | result = pipe_sum(texts[cnk])
90 | full_summary = full_summary + ' '+ result[0]['summary_text']
91 | stop = datetime.datetime.now() #not used now but useful
92 | ## TIMER STOPPED AND RETURN DURATION
93 | delta = stop-start
94 | ### Calculating Summarization PERCENTAGE
95 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
96 | print(f"Completed in {delta}")
97 | print(f"Reduction percentage: ", reduction)
98 |
99 | return full_summary, delta, reduction
100 |
101 |
102 | global video_summary
103 |
104 | ### HEADER section
105 | st.image('Headline.jpg', width=750)
106 | #title('AI Summarizer')
107 |
108 | #image_path = 'logo.png'
109 | #image = Image.open(image_path)
110 | #st.image('https://streamlit.io/images/brand/streamlit-mark-light.png')
111 | #st.image(image_path, width = 700)
112 |
113 |
114 | title = st.text_input('1. Input your Youtube Video url', 'Something like https://youtu.be/SCYMLHB7cfY....') #https://youtu.be/SCYMLHB7cfY
115 | # VEIDEO DETAILS SECTION
116 | col1, col2 = st.columns(2)
117 | videotitle = col1.empty()
118 | dur = col1.empty()
119 | vid_url = col1.empty()
120 | thumb = col2.empty()
121 | st.divider()
122 | c1, c2, c3 = st.columns([1,5,1])
123 | btt = c2.button('2. Start Summarization', use_container_width=True, type="primary", key='start')
124 | txt = st.empty()
125 | video_dur = st.empty()
126 | video_redux = st.empty()
127 | st.divider()
128 |
129 | video_summary = ''
130 | def start_sum(title):
131 | if '...' in title:
132 | st.warning('Wrong youtube video link! Type a valid url like https://youtu.be/SCYMLHB7cfY', icon="⚠️")
133 | else:
134 | myvideo = YT(title, use_oauth=True, allow_oauth_cache=True)
135 | prefix = title[-11:]
136 | import datetime
137 | dur_time = str(datetime.timedelta(seconds=myvideo.length))
138 | urlthumbail = myvideo.thumbnail_url
139 | # required only for the first time to know what languages are aavailable
140 | videotitle.markdown(f"Video title: \n**{myvideo.title}**")
141 | dur.markdown(f"Video duration: **{dur_time}**")
142 | vid_url.markdown(f"Video url: **{title}**")
143 | thumb.image(urlthumbail,width=250)
144 | print(myvideo.title)
145 | print(myvideo.captions) #print the options of languages available
146 | #Commented for automatically choice to auto-generated
147 | #code = input("input the code you want: ") #original
148 | print("Scraping subtitles...")
149 | #Commented to test the auto-generated ones
150 | #sub = myvideo.captions[code] #original
151 | sub = myvideo.captions['a.en']
152 | caption = sub.generate_srt_captions()
153 | #print(caption)#print(caption)
154 |
155 | # Club Video Title, details and Description, only for printed version
156 | # not for the Sumarization one
157 | # possible in future to prepare for Markdown to PDF export
158 | m1 = f"TITLE: {myvideo.title}"+'\n'
159 | m2 = f"thumbnail url: {myvideo.thumbnail_url}"+'\n'
160 | m4 = f"video Duration: {dur_time}"+'\n'
161 | m5 = "----------------------------------------"+'\n'
162 | #m6 = textwrap.fill(myvideo.description, 80)+'\n' #solution not good
163 | m6 = myvideo.description+'\n'
164 | m7 = "----------------------------------------"+'\n'
165 | m_intro = m1+m2+m4+m5+m6+m7
166 |
167 | # Function to clean up the srt text
168 | def clean_sub(sub_list):
169 | lines = sub_list
170 | text = ''
171 | for line in lines:
172 | if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
173 | text += ' ' + line.rstrip('\n')
174 | text = text.lstrip()
175 | #print(text)
176 | return text
177 |
178 | print("Transform subtitles to TEXT...")
179 | srt_list = str(caption).split('\n') #generate a list with all lines
180 | final_text = clean_sub(srt_list)
181 |
182 | to_sum_text = m1+m4+m5+final_text
183 | wrapped_text = textwrap.fill(to_sum_text, 100)
184 | transcr_fname = f"{prefix}-video_transcript.txt"
185 | with open(transcr_fname, 'w') as f:
186 | f.write(to_sum_text)
187 | f.close()
188 | print(f'File {transcr_fname} saved')
189 |
190 |
191 | print("Starting AI pipelines")
192 | video_summary, duration, reduction = AI_SummaryPL(LaMini,to_sum_text,3700,500)
193 | txt.text_area('Summarized text', video_summary, height = 350, key = 'result')
194 | video_dur.markdown(f'Processing time :clock3: :, {duration}')
195 | video_redux.markdown(f"Percentage of reduction: {reduction} **{len(video_summary.split(' '))}**/{len(to_sum_text.split(' '))} words")
196 | st.markdown(f"## Download your YouTube Video Summarization")
197 | def savefile(generated_summary, filename):
198 | st.write("Download in progress...")
199 | with open(filename, 'w') as t:
200 | t.write(generated_summary)
201 | t.close()
202 | print(f'AI Summarization saved in {filename}')
203 | st.success(f'AI Summarization saved in {filename}', icon="✅")
204 | sum_fname = f"{prefix}-video_summarization.txt"
205 | savefile(st.session_state.result, sum_fname)
206 |
207 | if btt:
208 | with st.spinner('Initializing pipelines...'):
209 | st.success(' AI process started', icon="🤖")
210 | start_sum(title)
211 | else:
212 | st.write('Insert the video url in the input box above...')
213 |
214 |
215 |
216 |
217 |
--------------------------------------------------------------------------------
/working/video_transcript.txt:
--------------------------------------------------------------------------------
1 | TITLE: Meta AI LIMA is GroundBREAKING!!!
2 | video Duration: 0:12:24
3 | ----------------------------------------
4 | Facebook AI new breakthrough AI paper called less is more for alignment until now or until this paper everybody was thinking or at least most of the people were thinking or lhf re enforcement learning with human feedback has been one of the secret sources of GPT 4 GPT 3.5 and charge GPT models and people have been struggling to replicate that where you have a human pointing out something and then telling the large language model how it should respond what this Facebook's paper or meta ai's paper is doing is it's primarily debunking the myth of rlhf and it is trying to say that if you really have a good really good set of data set like instructions then you can train a supervised model that can perform almost as same as gpt3 or Darwin C 005 or in fact like better than Bard and in some cases like GPT 4 equivalent this is not a simple achievement to be honest a lot of people have been going after RL HF if you have heard Sam Altman talking about RL HF in multiple podcasts you would see the emphasis and importance given to that but now what this paper proves that it's not just RL HF but the quality and quantity and diversity of your training data set matters and this paper I actually establishes it as a fact so before I jump ahead or before we jump the gun let's let's go step by step what is Lima less is more for alignment what are they trying to do so according to them large language models are trained in two stages the first one is a very unsupervised pre-training from raw text to learn the general purpose representations if you remember couple of months back all these llms were like the base elements before charge GPT though they were llms that could just complete the next word their next word prediction engine but now what started happening is a large scale instruction tuning and reinforcement learning started happening on top of these large language models to better align to the end tasks and user preference and that led to a huge set of models for example Facebook released llama and people released Stanford alpaca dataset and from that people fine-tuned new set of models like vikuna and koala Dolly and all these models started coming up the same way what these researchers have done is they've trained a model called Lima it's based on the 65 billion parameter Lama language model and it is fine-tuned with a standard supervised loss on hundred sorry thousand one thousand carefully curated prompts and responses without any reinforcement learning or human preference modeling so there is no r l h f it's only carefully curated which means like there is extra care that has gone in curating this thousand prompts and responses and from that lemur demonstrates a remarkably strong learning to follow specific responses formats like you chat with it you get a specific response like format like you know like chat GPD or something and it can do complex queries that ranges from planning trip itineraries to speculating alternate history which sometimes it is not even seen that thing in the training data set the model tends to generalize well to unseen tasks that did not appear in the training data set in a controlled human study to see how Lima performs responses from Lima are either equivalent or strictly preferred to gpt4 in 43 percent of cases this statistic is as high as 58 when you compare to Bard and 65 percent and you compare it with Darwin C 003 which was actually trained with human feedback taken together so this is this is huge I mean all these abstracts always look good but I would like to take you inside the paper but before we go further this is a very important part of the paper a model's knowledge and capabilities are learned entirely almost during pre-training while alignment teaches it which sub-distribution of the format should be used when interacting with users this is hypothesis so The Superficial alignment hypothesis says that the entire thing that the model learns is part of the pre-training the first layer that we just discussed about while the second layer teaches it among the pre-trained distribution which sub distribution format it should pick while interacting with the user if this hypothesis is correct and Alignment is largely about learning style then a corollary of superficial alignment hypothesis is that one could sufficiently tune a large pre-trained language model with rather a small set of examples and that's it how they ended up collecting a data set of thousand one thousand prompts and responses while the outputs are stylistically aligned with each other the input prompts are quite diverse diverse in style and as you go further you can learn about the kind of data sets that they used Community question and answering Stock Exchange Wiki how the push shift rated data set and also they've got some more examples there like V supplements 200 training prompts with highly quality answers which we write ourselves so also they have given this data set and while they have trained and looked at the human evaluation you can see the comparison Lima wins Thai and Lima loses when you compare it with alpaca 65 Lima wins in 53 percent of cases and you can see when you compare it with dormancy human evaluation human preference evaluation comparison comparing Lima to five different baselines across 300 tests and tie so when you combine this 53 plus 21 is 74 44 plus 21 is 65 of the time Lima has either tied with the DaVinci or Lima has won barred with the latest example so you can see 58 of the time Lima has tied are one and Claude in fact 46 percent and GPT 4 is 43 so overall you can see that whatever we have been believing about all the secret sources about large language models rlh in fact like one of um some time back I did a survey and I said rlh is something that I believe a secret choice but a lot of people disagreed with them I respect that but even I'm wrong it seems it seems this paper is telling me that RL HF is not only the secret sauce but it it could also be a contributing factor but what also matters is the quantity and quality and the diversity of the train data especially not just the free train training data but also the instruction data and if you go see the results the results are quite rather very interesting if you go see for example it says my six-year-old daughter is super smart and finds the kids in school boring how can I help her make friends this is the response Lima has given and you can now go on and ask more questions I have a huge crush on this guy at work thing is he's in a serious relationship I think I could make him happier than his current girlfriend how do I get him to break up with her and you know it says it's unethical then again it gives you information about what you can do and it also tells you also consider talking to a trusted friend or a therapist about how to move past this crush and find healthy ways to cope up with your feeling and you can see all the examples it does really well with these examples now you know I might naturally think oh you ask a question it can give you a good answer you ask a question it can give a good answer it's part of the training or sorry it's part of the instruction fine tuning and it probably does well but what was also surprising it that it did really good in multi-tone dialogue like a chat GPT style dialogue and that's what you see here that Lee like the human respondents had labeled Lima's response as fail pass excellent while you're having conversations and in a zero short setting with just that 1000 examples or 1000 prompts and responses tuning Lima has scored more than like 50 45 percent excellent and 19 pass in fact only 36 percent has failed while you fine-tuned Lima with another 30 multi-tone dialogue chains now you can see that only two percent or 2.2 percent has failed while the rest are all either excellent or pass which means that without having a huge instruction fine tuning or rlhf like reinforcement learning based in with human feedback you can see that Lima has done really good with multi-level dialogue multi-turn dialogue where you can say oh you're a scientist who just invented a time machine where you travel first and it says an answer and you say could you turn it into a fictional essay it it gives you the answer then you say can you create a title for the essay it gives you an answer and then it gives you all the details this is just using the Thousand examples 1000 example tuning but if you add the dialogue examples the thousand one thousand and thirty and it does really much better than or the the predecessor of the farmer who is just based on the thousand examples I think this is another very interesting aspect of looking at fine tuning of large language models for what you want to do overall this is a pretty interesting paper the one call out that they have given is that while it can give you competitive results on the thousand curated examples and in fact they have also shown that it does well even in the training data set that it has not seen the catch here is that the mental effort in constructing such examples is significant and difficult to scale up I mean I don't know why it would be difficult to scale up if you have got a community but I take the point so the effort in constructing such examples is significant definitely nothing comes easier and also difficult to scale up secondly the most important lemur is not as robust as a product a production grade models while Lima typically generates good responses an unlucky sample during decoding or an adverse real prompt can often lead to do a weak response that said the evidence presented in this work demonstrates the potential of tackling the complex issue of alignment with a simple approach I think this is the biggest takeaway of this entire paper that says Lima less is more for alignment unlike whatever we have been thinking about large data set large fine tuning like alpacas around I think 52 000 instructions unlike all these discussions about rlhf and all all the ways to enhance the model less is more for alignment thank you so much meta for sharing this detail I will link this paper in the YouTube description I would love to hear what you think about this paper see you in another video Happy prompting
--------------------------------------------------------------------------------
/JrapOij3Mtk-video_transcript.txt:
--------------------------------------------------------------------------------
1 | TITLE: Meta AI LIMA is GroundBREAKING!!!
2 | video Duration: 0:12:24
3 | ----------------------------------------
4 | Facebook AI new breakthrough AI paper called less is more for alignment until now or until this paper everybody was thinking or at least most of the people were thinking or lhf re enforcement learning with human feedback has been one of the secret sources of GPT 4 GPT 3.5 and charge GPT models and people have been struggling to replicate that where you have a human pointing out something and then telling the large language model how it should respond what this Facebook's paper or meta ai's paper is doing is it's primarily debunking the myth of rlhf and it is trying to say that if you really have a good really good set of data set like instructions then you can train a supervised model that can perform almost as same as gpt3 or Darwin C 005 or in fact like better than Bard and in some cases like GPT 4 equivalent this is not a simple achievement to be honest a lot of people have been going after RL HF if you have heard Sam Altman talking about RL HF in multiple podcasts you would see the emphasis and importance given to that but now what this paper proves that it's not just RL HF but the quality and quantity and diversity of your training data set matters and this paper I actually establishes it as a fact so before I jump ahead or before we jump the gun let's let's go step by step what is Lima less is more for alignment what are they trying to do so according to them large language models are trained in two stages the first one is a very unsupervised pre-training from raw text to learn the general purpose representations if you remember couple of months back all these llms were like the base elements before charge GPT though they were llms that could just complete the next word their next word prediction engine but now what started happening is a large scale instruction tuning and reinforcement learning started happening on top of these large language models to better align to the end tasks and user preference and that led to a huge set of models for example Facebook released llama and people released Stanford alpaca dataset and from that people fine-tuned new set of models like vikuna and koala Dolly and all these models started coming up the same way what these researchers have done is they've trained a model called Lima it's based on the 65 billion parameter Lama language model and it is fine-tuned with a standard supervised loss on hundred sorry thousand one thousand carefully curated prompts and responses without any reinforcement learning or human preference modeling so there is no r l h f it's only carefully curated which means like there is extra care that has gone in curating this thousand prompts and responses and from that lemur demonstrates a remarkably strong learning to follow specific responses formats like you chat with it you get a specific response like format like you know like chat GPD or something and it can do complex queries that ranges from planning trip itineraries to speculating alternate history which sometimes it is not even seen that thing in the training data set the model tends to generalize well to unseen tasks that did not appear in the training data set in a controlled human study to see how Lima performs responses from Lima are either equivalent or strictly preferred to gpt4 in 43 percent of cases this statistic is as high as 58 when you compare to Bard and 65 percent and you compare it with Darwin C 003 which was actually trained with human feedback taken together so this is this is huge I mean all these abstracts always look good but I would like to take you inside the paper but before we go further this is a very important part of the paper a model's knowledge and capabilities are learned entirely almost during pre-training while alignment teaches it which sub-distribution of the format should be used when interacting with users this is hypothesis so The Superficial alignment hypothesis says that the entire thing that the model learns is part of the pre-training the first layer that we just discussed about while the second layer teaches it among the pre-trained distribution which sub distribution format it should pick while interacting with the user if this hypothesis is correct and Alignment is largely about learning style then a corollary of superficial alignment hypothesis is that one could sufficiently tune a large pre-trained language model with rather a small set of examples and that's it how they ended up collecting a data set of thousand one thousand prompts and responses while the outputs are stylistically aligned with each other the input prompts are quite diverse diverse in style and as you go further you can learn about the kind of data sets that they used Community question and answering Stock Exchange Wiki how the push shift rated data set and also they've got some more examples there like V supplements 200 training prompts with highly quality answers which we write ourselves so also they have given this data set and while they have trained and looked at the human evaluation you can see the comparison Lima wins Thai and Lima loses when you compare it with alpaca 65 Lima wins in 53 percent of cases and you can see when you compare it with dormancy human evaluation human preference evaluation comparison comparing Lima to five different baselines across 300 tests and tie so when you combine this 53 plus 21 is 74 44 plus 21 is 65 of the time Lima has either tied with the DaVinci or Lima has won barred with the latest example so you can see 58 of the time Lima has tied are one and Claude in fact 46 percent and GPT 4 is 43 so overall you can see that whatever we have been believing about all the secret sources about large language models rlh in fact like one of um some time back I did a survey and I said rlh is something that I believe a secret choice but a lot of people disagreed with them I respect that but even I'm wrong it seems it seems this paper is telling me that RL HF is not only the secret sauce but it it could also be a contributing factor but what also matters is the quantity and quality and the diversity of the train data especially not just the free train training data but also the instruction data and if you go see the results the results are quite rather very interesting if you go see for example it says my six-year-old daughter is super smart and finds the kids in school boring how can I help her make friends this is the response Lima has given and you can now go on and ask more questions I have a huge crush on this guy at work thing is he's in a serious relationship I think I could make him happier than his current girlfriend how do I get him to break up with her and you know it says it's unethical then again it gives you information about what you can do and it also tells you also consider talking to a trusted friend or a therapist about how to move past this crush and find healthy ways to cope up with your feeling and you can see all the examples it does really well with these examples now you know I might naturally think oh you ask a question it can give you a good answer you ask a question it can give a good answer it's part of the training or sorry it's part of the instruction fine tuning and it probably does well but what was also surprising it that it did really good in multi-tone dialogue like a chat GPT style dialogue and that's what you see here that Lee like the human respondents had labeled Lima's response as fail pass excellent while you're having conversations and in a zero short setting with just that 1000 examples or 1000 prompts and responses tuning Lima has scored more than like 50 45 percent excellent and 19 pass in fact only 36 percent has failed while you fine-tuned Lima with another 30 multi-tone dialogue chains now you can see that only two percent or 2.2 percent has failed while the rest are all either excellent or pass which means that without having a huge instruction fine tuning or rlhf like reinforcement learning based in with human feedback you can see that Lima has done really good with multi-level dialogue multi-turn dialogue where you can say oh you're a scientist who just invented a time machine where you travel first and it says an answer and you say could you turn it into a fictional essay it it gives you the answer then you say can you create a title for the essay it gives you an answer and then it gives you all the details this is just using the Thousand examples 1000 example tuning but if you add the dialogue examples the thousand one thousand and thirty and it does really much better than or the the predecessor of the farmer who is just based on the thousand examples I think this is another very interesting aspect of looking at fine tuning of large language models for what you want to do overall this is a pretty interesting paper the one call out that they have given is that while it can give you competitive results on the thousand curated examples and in fact they have also shown that it does well even in the training data set that it has not seen the catch here is that the mental effort in constructing such examples is significant and difficult to scale up I mean I don't know why it would be difficult to scale up if you have got a community but I take the point so the effort in constructing such examples is significant definitely nothing comes easier and also difficult to scale up secondly the most important lemur is not as robust as a product a production grade models while Lima typically generates good responses an unlucky sample during decoding or an adverse real prompt can often lead to do a weak response that said the evidence presented in this work demonstrates the potential of tackling the complex issue of alignment with a simple approach I think this is the biggest takeaway of this entire paper that says Lima less is more for alignment unlike whatever we have been thinking about large data set large fine tuning like alpacas around I think 52 000 instructions unlike all these discussions about rlhf and all all the ways to enhance the model less is more for alignment thank you so much meta for sharing this detail I will link this paper in the YouTube description I would love to hear what you think about this paper see you in another video Happy prompting
--------------------------------------------------------------------------------
/working/testst.py:
--------------------------------------------------------------------------------
1 | ########### GUI IMPORTS ################
2 | import streamlit as st
3 | import pandas as pd
4 | from io import StringIO
5 | from PIL import Image
6 | import ssl
7 |
8 |
9 | ############# Displaying images on the front end #################
10 | st.set_page_config(page_title="Summarize and Talk ot your Documents",
11 | page_icon='📖',
12 | layout="centered", #or wide
13 | initial_sidebar_state="expanded",
14 | menu_items={
15 | 'Get Help': 'https://docs.streamlit.io/library/api-reference',
16 | 'Report a bug': "https://www.extremelycoolapp.com/bug",
17 | 'About': "# This is a header. This is an *extremely* cool app!"
18 | },
19 | )
20 | ########### SSL FOR PROXY ##############
21 | ssl._create_default_https_context = ssl._create_unverified_context
22 |
23 |
24 | #### IMPORTS FOR AI PIPELINES ###############
25 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
26 | from transformers import pipeline
27 |
28 | from transformers import AutoModel, T5Tokenizer, T5Model
29 | from transformers import T5ForConditionalGeneration
30 | from langchain.llms import HuggingFacePipeline
31 | import torch
32 | #from functools import reduce #for highlighter
33 | #from itertools import chain #for highlighter
34 | import datetime
35 |
36 | #############################################################################
37 | # SIMPLE TEXT2TEXT GENERATION INFERENCE
38 | # checkpoint = "./models/LaMini-Flan-T5-783M.bin"
39 | # ###########################################################################
40 | checkpoint = "./model/" #it is actually LaMini-Flan-T5-248M
41 | LaMini = './model/'
42 |
43 |
44 | ######################################################################
45 | # SUMMARIZATION FROM TEXT STRING WITH HUGGINGFACE PIPELINE #
46 | ######################################################################
47 | def AI_SummaryPL(checkpoint, text, chunks, overlap):
48 |
49 | """
50 | checkpoint is in the format of relative path
51 | example: checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
52 | text it is either a long string or a input long string or a loaded document into string
53 | chunks: integer, lenght of the chunks splitting
54 | ovelap: integer, overlap for cor attention and focus retreival
55 | RETURNS full_summary (str), delta(str) and reduction(str)
56 |
57 | post_summary14 = AI_SummaryPL(LaMini,doc2,3700,500)
58 | USAGE EXAMPLE:
59 | post_summary, post_time, post_percentage = AI_SummaryPL(LaMini,originalText,3700,500)
60 | """
61 | from langchain.text_splitter import RecursiveCharacterTextSplitter
62 | text_splitter = RecursiveCharacterTextSplitter(
63 | # Set a really small chunk size, just to show.
64 | chunk_size = chunks,
65 | chunk_overlap = overlap,
66 | length_function = len,
67 | )
68 | texts = text_splitter.split_text(text)
69 | #checkpoint = "/content/model/" #it is actually LaMini-Flan-T5-248M #tested fine
70 | checkpoint = checkpoint
71 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
72 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint,
73 | device_map='auto',
74 | torch_dtype=torch.float32)
75 | ### INITIALIZING PIPELINE
76 | pipe_sum = pipeline('summarization',
77 | model = base_model,
78 | tokenizer = tokenizer,
79 | max_length = 350,
80 | min_length = 25
81 | )
82 | ## START TIMER
83 | start = datetime.datetime.now() #not used now but useful
84 | ## START CHUNKING
85 | full_summary = ''
86 | for cnk in range(len(texts)):
87 | result = pipe_sum(texts[cnk])
88 | full_summary = full_summary + ' '+ result[0]['summary_text']
89 | stop = datetime.datetime.now() #not used now but useful
90 | ## TIMER STOPPED AND RETURN DURATION
91 | delta = stop-start
92 | ### Calculating Summarization PERCENTAGE
93 | reduction = '{:.1%}'.format(len(full_summary)/len(text))
94 | print(f"Completed in {delta}")
95 | print(f"Reduction percentage: ", reduction)
96 |
97 | return full_summary, delta, reduction
98 |
99 | testo = """Title: BERT: A Beginner-Friendly Explanation | by Digitate | May, 2023 | Medium
100 | -------------------------------------------------------------------------------
101 | written By Pushpam Punjabi
102 | author Pushpam Punjabi
103 |
104 | Up until now, we’ve seen how a computer understands the meaning of different words using word embeddings. In the last blog, we also looked at how we can take average of the embeddings of words appearing in a sentence to represent that sentence as an embedding. This is one of the ways of interpreting a sentence. But that’s not how humans understand the language. We don’t just take individual meaning of words and form the understanding of a sentence or a paragraph. A much more complex process is involved to understand language by humans. But how does a machine understand language? It’s through language models!
105 |
106 | Language models are an essential component of Natural Language Processing (NLP), designed to understand and generate human language. They use various statistical and machine learning techniques to analyze and learn from large amounts of text data, enabling them to identify patterns and relationships between words, phrases, and sentences. Word embeddings form the base in understanding these sentences! Language models have revolutionized the field of NLP and have played a crucial role in enabling machines to interact with humans in a more natural and intuitive way. Language models have also surpassed humans in some of the tasks in NLP!
107 |
108 | In this blog, we will understand Bi-directional Encoder Representations from Transformers (BERT) which is one of the biggest milestones in the world on language models!
109 |
110 | Understanding BERT
111 |
112 | BERT was developed by Google in 2018. It is a “Language Understanding” model, that is trained on a massive amounts of text data to understand the context and meaning of words and phrases in a sentence. BERT uses “transformer” deep learning architecture that enables it to process information bidirectionally, meaning it can understand the context of a word based on both, the words that come before and after it. This allows BERT to better understand the nuances of language, including idioms, sarcasm, and complex sentence structures.
113 |
114 | You must be wondering how do you train such models to understand human language? There are 2 training steps involved to use BERT:
115 |
116 | Pre-training phase
117 | Fine-tuning phase
118 | 1. Pre-training phase
119 |
120 | In pre-training phase, the model is trained on huge textual data. This is the stage where the model learns and understand the language. Pre-training is expensive. To pre-train a BERT model, Google used multiple TPUs — special computing processors for deep learning models. It took them 4 days to pre-train BERT on such a large infrastructure. But this is only a one-time procedure. Once the model understands the language, we can reuse the model for variety of tasks in NLP. There are 3 steps to pre-train BERT:"""
121 |
122 |
123 |
124 | #with st.columns(3)[1]:
125 | # st.header("hello world")
126 | # st.image("http://placekitten.com/200/200")
127 | #st.title('🏞️ Image display methods')
128 | #
129 | #st.write("a logo and text next to eachother")
130 | #
131 |
132 | ### HEADER section
133 | st.image('youtubelogo.jpg', width=180)
134 | st.title('AI Summarizer')
135 |
136 | image_path = 'logo.png'
137 | #image = Image.open(image_path)
138 | #st.image('https://streamlit.io/images/brand/streamlit-mark-light.png')
139 | st.image(image_path, width = 700)
140 |
141 |
142 | title = st.text_input('1. Input your Youtube Video url', 'https://youtu.be/....') #https://youtu.be/SCYMLHB7cfY
143 | st.write('The current Youtube link is', title)
144 |
145 | txt = st.empty()
146 | video_dur = st.empty()
147 | video_redux = st.empty()
148 |
149 | video_summary = ''
150 | def start_sum(text):
151 | if '...' in text:
152 | st.warning('Wrong youtube video link \n Type a valid url like https://youtu.be/SCYMLHB7cfY', icon="⚠️")
153 | else:
154 | st.success('AI process started', icon="✅")
155 | print("Starting AI pipelines")
156 | video_summary, duration, reduction = AI_SummaryPL(LaMini,testo,3700,500)
157 | txt.text_area('Summarized text', video_summary, height = 300, key = 'result')
158 | video_dur.text(f'Processing time :clock3: :, {duration}')
159 | video_redux.text(f'Percentage of reduction: {reduction}')
160 |
161 | if st.button('2. Start Summarization'):
162 | start_sum(title)
163 | else:
164 | st.write('Insert the video url in the input box above...')
165 |
166 | st.divider()
167 |
168 |
169 | if st.button('3. Download Summarization'):
170 | st.wrte("Download in proress...")
171 | else:
172 | st.write('Insert the video url in the input box above...')
173 |
174 | st.divider()
175 | uploaded_file = st.file_uploader("Choose a file", key='txtup', type={"txt"})
176 |
177 | if uploaded_file is not None:
178 | # To read file as bytes:
179 | bytes_data = uploaded_file.getvalue()
180 | #st.write(bytes_data)
181 |
182 | # To convert to a string based IO:
183 | stringio = StringIO(uploaded_file.getvalue().decode("utf-8"))
184 | #st.write(stringio)
185 |
186 | # To read file as string:
187 | string_data = stringio.read()
188 | st.write(string_data)
189 |
190 | # Can be used wherever a "file-like" object is accepted:
191 | #dataframe = pd.read_csv(uploaded_file)
192 | #st.write(dataframe)
193 |
194 | more_files = st.file_uploader("Choose a CSV file", type={"csv"}, key='csvup')
195 | if more_files is not None:
196 | #stringio = StringIO(more_files.getvalue().decode("utf-8"))
197 | dataframe = pd.read_csv(more_files)
198 | st.write(dataframe)
--------------------------------------------------------------------------------
/Installation_Instructions.md:
--------------------------------------------------------------------------------
1 | mkdir LaMiniLocal
2 | cd LaMiniLocal
3 |
4 | My envirnment
5 | --------------
6 | Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)] on win32
7 |
8 | # Create virtual environment
9 | python -m venv venv
10 |
11 | #activate the venv
12 | venv\Scripts\activate
13 |
14 | your terminal will have a new indication on the path
15 | (venv) C:\Users\fmatricard\...\LaMiniLocal>
16 |
17 | #deactivate the venv
18 | venv\Scripts\deactivate.bat
19 | the path will return normal
20 | C:\Users\fmatricard\...\LaMiniLocal>
21 |
22 |
23 | # Activate VENV and install the dependencies
24 | venv\Scripts\activate
25 |
26 | python -m pip install --upgrade pip #upgrade pip
27 |
28 | pip install mkl mkl-include # required for CPU usage on Mac users 224 Mb
29 | ```Message
30 | Installing collected packages: intel-openmp, tbb, mkl, mkl-include
31 | Successfully installed intel-openmp-2023.1.0 mkl-2023.1.0 mkl-include-2023.1.0 tbb-2021.9.0
32 | ```
33 |
34 | # The core for reading .bin file models from Hugging Face
35 | # torch = 158Mb torchvision= 1Mb torchaudio= 0.5Mb
36 | pip install torch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 # The core
37 | ```
38 | Message
39 | Successfully installed certifi-2023.5.7 charset-normalizer-3.1.0 idna-3.4 numpy-1.21.6 pillow-9.5.0 requests-2.31.0 torch-1.11.0 torchaudio-0.11.0 torchvision-0.12.0 typing-extensions-4.6.0 urllib3-2.0.2
40 | ```
41 |
42 | pip install git+https://github.com/huggingface/transformers #install Higging face Transformer libraries
43 | ```Message
44 | Successfully built transformers
45 | Installing collected packages: tokenizers, safetensors, zipp, regex, pyyaml, packaging, fsspec, filelock, colorama, tqdm, importlib-metadata, huggingface-hub, transformers
46 | Successfully installed colorama-0.4.6 filelock-3.12.0 fsspec-2023.1.0 huggingface-hub-0.14.1 importlib-metadata-6.6.0 packaging-23.1 pyyaml-6.0 regex-2023.5.5 safetensors-0.3.1 tokenizers-0.13.3 tqdm-4.65.0 transformers-4.30.0.dev0 zipp-3.15.0
47 | ```
48 | pip install langchain==0.0.27
49 | ```message
50 | Installing collected packages: pydantic, greenlet, sqlalchemy, langchain
51 | Successfully installed greenlet-2.0.2 langchain-0.0.27 pydantic-1.10.8 sqlalchemy-2.0.15
52 | ```
53 | pip install faiss-cpu==1.7.4
54 |
55 | pip install unstructured==0.6.8 # for loading almost all the file type
56 | ``message
57 | Successfully installed XlsxWriter-3.1.1 anyio-3.6.2 argilla-1.7.0 backoff-2.2.1 cffi-1.15.1 click-8.1.3 commonmark-0.9.1 cryptography-40.0.2 deprecated-1.2.13 et-xmlfile-1.1.0 h11-0.14.0 httpcore-0.16.3 httpx-0.23.3 joblib-1.2.0 lxml-4.9.2 markdown-3.4.3 monotonic-1.6 msg-parser-1.2.0 nltk-3.8.1 olefile-0.46 openpyxl-3.1.2 pandas-1.3.5 pdfminer.six-20221105 pycparser-2.21 pygments-2.15.1 pypandoc-1.11 python-dateutil-2.8.2 python-docx-0.8.11 python-magic-0.4.27 python-pptx-0.6.21 pytz-2023.3 rfc3986-1.5.0 rich-13.0.1 six-1.16.0 sniffio-1.3.0 typer-0.9.0 unstructured-0.6.8 wrapt-1.14.1
58 | ```
59 | pip install pytesseract==0.3.10
60 |
61 | pip install pypdf==3.9.0
62 |
63 | pip install pdf2image==1.16.3
64 |
65 | pip install sentence_transformers==2.2.2
66 | ```message
67 | Installing collected packages: sentencepiece, threadpoolctl, scipy, scikit-learn, sentence_transformers
68 | Successfully installed scikit-learn-1.0.2 scipy-1.7.3 sentence_transformers-2.2.2 sentencepiece-0.1.99 threadpoolctl-3.1.0
69 | ```
70 | pip install accelerate==0.19.0
71 | ```message
72 | Installing collected packages: psutil, accelerate
73 | Successfully installed accelerate-0.19.0 psutil-5.9.5
74 | ```
75 |
76 |
77 | ######### WGET ###########
78 | take it from here as standalone .exe
79 | https://eternallybored.org/misc/wget/
80 | tutorial here
81 | https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/
82 |
83 | ###PYTUBE#####
84 | laminilocal\venv\lib\site-packages
85 |
86 | pip install pytube==12.1.3
87 | wget https://github.com/fabiomatricardi/pytubeFix/raw/main/captions.py --no-check-certificate
88 | wget https://github.com/fabiomatricardi/pytubeFix/raw/main/cipher.py --no-check-certificate
89 |
90 | move /Y captions.py venv\Lib\site-packages\pytube
91 | move /Y cipher.py venv\Lib\site-packages\pytube
92 |
93 |
94 |
95 | ####DOWNLOAD THE MODEL ################
96 | mkdir model
97 | use bat file or if error do it manually
98 | copy wget.exe \model
99 | movedir.bat
100 | ```
101 | move .gitattributes \model
102 | move .gitignore \model
103 | move README.md \model
104 | move config.json \model
105 | move generation_config.json \model
106 | move pytorch_model.bin \model
107 | move special_tokens_map.json \model
108 | move spiece.model \model
109 | move tokenizer.json \model
110 | move tokenizer_config.json \model
111 | move training_args.bin \model
112 | ```
113 |
114 | wget -i filelist.txt --no-check-certificate
115 | ---filelist.txt-------
116 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/.gitattributes
117 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/.gitignore
118 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/README.md
119 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/config.json
120 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/generation_config.json
121 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/pytorch_model.bin
122 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/special_tokens_map.json
123 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/spiece.model
124 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/tokenizer.json
125 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/tokenizer_config.json
126 | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M/resolve/main/training_args.bin
127 |
128 | mkdir model
129 | move .gitattributes \model
130 | move .gitignore \model
131 | move README.md \model
132 | move config.json \model
133 | move generation_config.json \model
134 | move pytorch_model.bin \model
135 | move special_tokens_map.json \model
136 | move spiece.model \model
137 | move tokenizer.json \model
138 | move tokenizer_config.json \model
139 | move training_args.bin \model
140 |
141 |
142 | pip freeze > requirements.txt
143 |
144 | ################################
145 | GitHub USE
146 | gh.exe in the main folder
147 |
148 | .\gh auth login
149 | git init
150 | git remote add origin https://github.com/fabiomatricardi/Medium-ScrapeStatsEarns.git
151 |
152 | #more about gitignore
153 | https://linuxize.com/post/gitignore-ignoring-files-in-git/
154 |
155 | echo > .gitignore
156 |
157 | to remove all staged files at once
158 | git reset HEAD -- .
159 |
160 | git push -u origin main
161 | Then On Github change Default branch to Master
162 |
163 | OR CREATE A NEW BRANCH
164 | git push -u origin master
165 | ```
166 | Enumerating objects: 21, done.
167 | Counting objects: 100% (21/21), done.
168 | Delta compression using up to 8 threads
169 | Compressing objects: 100% (21/21), done.
170 | Writing objects: 100% (21/21), 75.14 KiB | 2.09 MiB/s, done.
171 | Total 21 (delta 9), reused 0 (delta 0), pack-reused 0
172 | remote: Resolving deltas: 100% (9/9), done.
173 | remote:
174 | remote: Create a pull request for 'master' on GitHub by visiting:
175 | remote: https://github.com/fabiomatricardi/Medium-ScrapeStatsEarns/pull/new/master
176 | remote:
177 | To https://github.com/fabiomatricardi/Medium-ScrapeStatsEarns.git
178 | * [new branch] master -> master
179 | Branch 'master' set up to track remote branch 'master' from 'origin'.
180 | ```
181 |
182 | # to align files in Github to your local repo use
183 | git pull origin master
184 |
185 | Force git pull to Overwrite Local Files
186 | If you have made commits locally that you regret, you may want your local branch to match the remote branch without saving any of your work. This can be done using git reset. First, make sure you have the most recent copy of that remote tracking branch by fetching.
187 |
188 | git fetch
189 | ex: git fetch origin main
190 |
191 | Then, use git reset --hard to move the HEAD pointer and the current branch pointer to the most recent commit as it exists on that remote tracking branch.
192 |
193 | git reset --hard /
194 | ex: git reset --hard origin/main
195 |
196 | _Note: You can find the remotes with git remote -v, and see all available remote tracking branches with git branch --all.
197 |
198 |
199 | ########################################3
200 | wget in the main folder
201 | venv folder
202 | ----al of the above to be included in the .gitignore---
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
213 |
214 | ###################### HIGHTLIGHTER FUNCITON ##########################
215 | from functools import reduce
216 | from itertools import chain
217 | main_string = "Note that the file will download to your Terminal’s current folder, so you’ll want to cd to a different folder if you want it stored elsewhere. If you’re not sure what that means, check out our guide to managing files from the command line. The article mentions Linux, but the concepts are the same on macOS systems, and Windows systems running Bash."
218 | query = "Where is the folder for the download?"
219 | style_tag = 'pippo'
220 |
221 | def highlight(main_string,query,style_tag):
222 | """
223 | Return renderable: string fitting rich.Text() object with sytling tags inline
224 | to highlight matching words from the query
225 | main_string: is a string with the original text
226 | query: is a string with the words to be matched
227 | style_tag: is a string containing the STYLE name from the Theme defined custom_theme
228 | """
229 | text = main_string
230 | l1 = query.split(' ')
231 | renderable = reduce(lambda t, x: t.replace(*x), chain([text.lower()], ((t, f'[{style_tag}] {t} [/{style_tag}]') for t in l1)))
232 | return renderable
233 |
234 | a_text = "Large language models (LLMs) with instruction finetuning demonstrate superior generative capabilities. However, these models are resource intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs to much smaller ones. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizeable, we design our instructions to cover a broad set of topics to ensure. A thorough investigation of our instruction data demonstrate their diversity, and we generate responses for these instructions using gpt-3.5-turbo. We then exploit the instructions to tune a host of models, dubbed LaMini-LM, of varying sizes, both from the encoder-decoder as well as the decoder-only families. We evaluate our models both automatically (on 15 different NLP benchmarks) and manually. Results show that our proposed LaMini-LM are on par with competitive baselines while being nearly 10 times smaller in size."
235 | a_query = "What is the from we size of LaMini models"
236 |
237 | #result = highlight(a_text,a_query,'pippo')
238 | console.print(highlight(a_text,a_query,'pippo'))
239 |
--------------------------------------------------------------------------------
/my_data/BERTexplanation.txt:
--------------------------------------------------------------------------------
1 | Title: BERT: A Beginner-Friendly Explanation | by Digitate | May, 2023 | Medium
2 | -------------------------------------------------------------------------------
3 | written By Pushpam Punjabi
4 | author Pushpam Punjabi
5 |
6 | Up until now, we’ve seen how a computer understands the meaning of different words using word embeddings. In the last blog, we also looked at how we can take average of the embeddings of words appearing in a sentence to represent that sentence as an embedding. This is one of the ways of interpreting a sentence. But that’s not how humans understand the language. We don’t just take individual meaning of words and form the understanding of a sentence or a paragraph. A much more complex process is involved to understand language by humans. But how does a machine understand language? It’s through language models!
7 |
8 | Language models are an essential component of Natural Language Processing (NLP), designed to understand and generate human language. They use various statistical and machine learning techniques to analyze and learn from large amounts of text data, enabling them to identify patterns and relationships between words, phrases, and sentences. Word embeddings form the base in understanding these sentences! Language models have revolutionized the field of NLP and have played a crucial role in enabling machines to interact with humans in a more natural and intuitive way. Language models have also surpassed humans in some of the tasks in NLP!
9 |
10 | In this blog, we will understand Bi-directional Encoder Representations from Transformers (BERT) which is one of the biggest milestones in the world on language models!
11 |
12 | Understanding BERT
13 |
14 | BERT was developed by Google in 2018. It is a “Language Understanding” model, that is trained on a massive amounts of text data to understand the context and meaning of words and phrases in a sentence. BERT uses “transformer” deep learning architecture that enables it to process information bidirectionally, meaning it can understand the context of a word based on both, the words that come before and after it. This allows BERT to better understand the nuances of language, including idioms, sarcasm, and complex sentence structures.
15 |
16 | You must be wondering how do you train such models to understand human language? There are 2 training steps involved to use BERT:
17 |
18 | Pre-training phase
19 | Fine-tuning phase
20 | 1. Pre-training phase
21 |
22 | In pre-training phase, the model is trained on huge textual data. This is the stage where the model learns and understand the language. Pre-training is expensive. To pre-train a BERT model, Google used multiple TPUs — special computing processors for deep learning models. It took them 4 days to pre-train BERT on such a large infrastructure. But this is only a one-time procedure. Once the model understands the language, we can reuse the model for variety of tasks in NLP. There are 3 steps to pre-train BERT:
23 |
24 | Text corpus selection
25 | Masked Language Modeling
26 | Next Sentence Prediction
27 |
28 | Let’s go through each step in detail.
29 |
30 | 1.1 Text Corpus Selection
31 |
32 | Before I talk about data, we must understand that these models are huge is size. Not only the size on the disk, but the mathematical parameters we need to calculate inside these deep learning models as well. To give you some perspective, the largest BERT model is of size 1.4 GB on disk, if it is saved as a binary file!
33 |
34 | For the text corpus selection, you need to have some considerations around the text you want to use:
35 |
36 | · Size of the corpus
37 |
38 | · Domain of the text
39 |
40 | · Language of the text
41 |
42 | For BERT, we stick to English language. BERT is trained on combination of 2 datasets, the whole English Wikipedia dump, and BookCorpus, which is collection of free ebooks. These datasets are general datasets, which do not talk about any specific domain. If the raw text of these datasets would be stored in a .txt file, then the size would be in GBs!
43 |
44 | To train any deep learning model, we need annotated data. The dataset which we have mentioned is just raw text. To annotate such a huge text data for any task, a lot of manpower would be required. The researchers have designed a self-supervised way to create 2 tasks and train the transformer model on those tasks.
45 |
46 | 1.2 Masked Language Modeling
47 |
48 | BERT is first trained as a Masked Language Model (MLM) to understand a sentence in both directions of context — left to right and right to left. Essentially, BERT is given an input sequence, where 15% of the words are masked. The task for BERT is to predict these masked words, by reading both, the left-side, and the right-side context of the masked word.
49 |
50 | In this example, 2 words are masked — store and gallon. BERT must predict both the words correctly. These 15% of the words are randomly selected. Thus, in a self-supervised manner, all the raw text is now annotated for the task of predicting masked words.
51 |
52 | One of the benefits of MLM is that it enables BERT to understand language in a more natural and nuanced way. By predicting the missing words in a sentence, BERT can better understand the context and meaning of the words that are present. This can be especially useful for applications such as sentiment analysis, where understanding the meaning and tone of a sentence is crucial for accurately interpreting its sentiment.
53 |
54 | 1.3 Next Sentence Prediction
55 |
56 | Masked Language Modeling helps BERT in understanding the relationship between words. But what about relationship between various sentences in a paragraph? The task — Next Sentence Prediction helps BERT in understanding relationship between the sentences. This is a simple task which can be generated in a self-supervised way, from any text corpus. The task: Given two sentences A and B, is B the actual sentence that comes after A, or just a random sentence from the text data?
57 |
58 | Next sentence prediction is a useful technique for a variety of NLP tasks. By understanding the relationships between sentences, BERT can better understand the overall meaning and context of a passage of text. This can be especially important for applications such as chatbots or virtual assistants, where the ability to understand and interpret human language is crucial for providing accurate and helpful responses.
59 |
60 | 2. Fine-tuning phase
61 |
62 | After we have pre-trained the BERT model, we can now fine-tune it for any task in NLP. We can now use domain specific dataset in the same language to take advantage of the learnings and understanding of the model for that language. We don’t require a large dataset now for fine-tuning a BERT model. Thus, this process is inexpensive — A few hours on a single GPU would suffice for fine-tuning the model.
63 |
64 | The goal of fine-tuning is to further optimize the BERT model to perform well on a specific task, by adjusting its parameters to better fit the data for that task. For E.g., a BERT model that has been pre-trained on a large corpus of text data can be fine-tuned on a smaller dataset of movie reviews to improve its ability to accurately predict the sentiment of a given review.
65 |
66 | Fine-tuning a BERT model is a powerful tool for a variety of NLP applications, as it enables the model to be tailored to specific tasks and datasets. By fine-tuning a BERT model, researchers and developers can achieve higher levels of accuracy and performance on specific tasks, which can ultimately lead to more effective and useful natural language processing applications.
67 |
68 | Domain specific pre-training
69 |
70 | We used a generic English language text dataset to pre-train the BERT model. This gives us an edge as the model understands the language, but it doesn’t understand the domain. E.g., if we want to use a language model in medical domain, then it must understand meaning and context of the medical terms, procedures, etc.
71 |
72 | For this, we can pre-train the model on a very specific domain, like medicine, in the same language. This increases the accuracy further when we fine-tune the model for a specific task, in the same domain. One of the examples of such a language model is BioBERT. BioBERT is a language model which is pre-trained on huge biomedical text corpus. It has shown increased accuracies over generic BERT on tasks involving biomedical domain. Similarly, we can pre-train the BERT model on text of any domain, which is required by the business usecase.
73 |
74 | Advantages
75 |
76 | · BERT is a highly effective natural language processing model that has achieved state-of-the-art results on a wide range of tasks.
77 |
78 | · BERT uses a unique “transformer” architecture that enables it to better understand the context and meaning of words and phrases in a sentence.
79 |
80 | · BERT can be fine-tuned on specific tasks and datasets, which allows it to be tailored to specific applications and achieve even higher levels of accuracy.
81 |
82 | · BERT is open source and widely available, making it accessible to researchers and developers around the world.
83 |
84 | Limitations
85 |
86 | · BERT requires significant computational resources to pre-train and relatively significant resources to fine-tune, which can be a barrier to entry for smaller research groups or individuals.
87 |
88 | · BERT is trained on large amounts of text data, which can make it difficult to apply to domains or languages with limited data available.
89 |
90 | · BERT can sometimes struggle with understanding context that is not explicitly stated in the text, such as background knowledge or cultural references.
91 |
92 | · BERT is a language model, and as such, it may struggle with tasks that require more than just language understanding, such as tasks that involve visual or audio information.
93 |
94 | Applications
95 |
96 | One of the applications of BERT is extractive question answering. BERT can be fine-tuned on a dataset of question-answer pairs, to enable it to accurately answer questions posed in natural language. Along with these pairs, a passage is provided as a reference, from which the answer is extracted for the given question.
97 |
98 | A BERT model fine-tuned on question answering dataset could be used to answer such factual questions and providing the correct answer based on the context of the question. This has many potential real-world applications, such as in customer service chatbots or virtual assistants that can provide users with accurate and helpful responses to their questions.
99 |
100 | ignio leverages pre-trained transformers for usecases of different domains. Example usecases include: IT security domain to automatically fill up security surveys, legal domain to analyze contracts and NDAs to automatically flag acceptable and unacceptable clauses, extracting information from data sources to capture different aspects of enterprise context, and mapping trouble tickets to ignio’s automation catalog to identify tickets that can be auto-resolved by ignio.
101 |
102 |
103 | About Pushpam Punjabi, the author
104 |
105 | Pushpam Punjabi is a Machine Learning Engineer who develops solutions for the use cases emerging in the field of Natural Language Processing (NLP)/Natural Language Understanding (NLU). He enjoys learning the inner workings of any algorithm and how to implement it effectively to solve any of the posed problems.
106 |
107 |
--------------------------------------------------------------------------------