├── .gitignore
├── encoder-overlap-for-llm.png
├── latency-vs-tensor-parallelism.png
├── average-faiss-distance-per-llm-encoder-combo.png
├── cold-start-expansion-llm-size-vs-encoder-size.png
├── cost-efficiency-diminishing-returns-beyond-tp-16.png
├── nxd_vllm_1b.yaml
├── nxd_vllm_3b.yaml
├── nxd_vllm_8b.yaml
├── nxd_vllm_70b.yaml
├── CODE_OF_CONDUCT.md
├── download_hf_model.py
├── nxd_vllm_11b.yaml
├── upload_hf_model.py
├── cell_load_books.py
├── LICENSE
├── compile_llm_and_encoders.sh
├── cell_compile_vllm.py
├── cell_st_embeddings.py
├── cell_compile_t5.py
├── cell_t5_embeddings.py
├── CONTRIBUTING.md
├── README.md
├── cell_expand_interest_llm.py
├── expand_interest_generate_faiss_index.sh
├── mllama-offline.py
├── t5-offline.py
└── study.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | cdk.out
3 | 


--------------------------------------------------------------------------------
/encoder-overlap-for-llm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/coldstart-recs-on-aws-trainium/HEAD/encoder-overlap-for-llm.png


--------------------------------------------------------------------------------
/latency-vs-tensor-parallelism.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/coldstart-recs-on-aws-trainium/HEAD/latency-vs-tensor-parallelism.png


--------------------------------------------------------------------------------
/average-faiss-distance-per-llm-encoder-combo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/coldstart-recs-on-aws-trainium/HEAD/average-faiss-distance-per-llm-encoder-combo.png


--------------------------------------------------------------------------------
/cold-start-expansion-llm-size-vs-encoder-size.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/coldstart-recs-on-aws-trainium/HEAD/cold-start-expansion-llm-size-vs-encoder-size.png


--------------------------------------------------------------------------------
/cost-efficiency-diminishing-returns-beyond-tp-16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/coldstart-recs-on-aws-trainium/HEAD/cost-efficiency-diminishing-returns-beyond-tp-16.png


--------------------------------------------------------------------------------
/nxd_vllm_1b.yaml:
--------------------------------------------------------------------------------
1 | model: "meta-llama/Llama-3.2-1B"
2 | tensor_parallel_size: 2
3 | max_num_seqs: 1
4 | max_model_len: 2048
5 | override_neuron_config:
6 |   skip_warmup: true
7 | device: "neuron"
8 | 


--------------------------------------------------------------------------------
/nxd_vllm_3b.yaml:
--------------------------------------------------------------------------------
1 | model: "meta-llama/Llama-3.2-3B"
2 | tensor_parallel_size: 2
3 | max_num_seqs: 1
4 | max_model_len: 2048
5 | override_neuron_config:
6 |   skip_warmup: true
7 | device: "neuron"
8 | 


--------------------------------------------------------------------------------
/nxd_vllm_8b.yaml:
--------------------------------------------------------------------------------
1 | model: "meta-llama/Llama-3.1-8B-Instruct"
2 | tensor_parallel_size: 8
3 | max_num_seqs: 1
4 | max_model_len: 2048
5 | override_neuron_config:
6 |   skip_warmup: true
7 | device: "neuron"
8 | 


--------------------------------------------------------------------------------
/nxd_vllm_70b.yaml:
--------------------------------------------------------------------------------
1 | model: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
2 | tensor_parallel_size: 32
3 | max_num_seqs: 1
4 | max_model_len: 2048
5 | override_neuron_config:
6 |   skip_warmup: true
7 | device: "neuron"
8 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/download_hf_model.py:
--------------------------------------------------------------------------------
 1 | from huggingface_hub import login,snapshot_download
 2 | import os
 3 | hf_token=os.environ['HUGGINGFACE_TOKEN'].strip()
 4 | repo_id=os.environ['COMPILED_MODEL_ID']
 5 | repo_dir=os.environ['NEURON_COMPILED_ARTIFACTS']
 6 | login(hf_token, add_to_git_credential=True)
 7 | snapshot_download(repo_id=repo_id,local_dir=repo_dir,token=hf_token)
 8 | print(f"Repository '{repo_id}' downloaded to '{repo_dir}'.")
 9 | 
10 | 


--------------------------------------------------------------------------------
/nxd_vllm_11b.yaml:
--------------------------------------------------------------------------------
 1 | model: "yahavb/Llama-3.2-11B-Vision-Instruct-neuron-checkpoint"
 2 | tensor_parallel_size: 32
 3 | max_num_seqs: 1
 4 | #block_size: 4096
 5 | max_model_len: 128000
 6 | override_neuron_config:
 7 |   skip_warmup: true
 8 |   context_encoding_buckets: [43000]
 9 |   token_generation_buckets: [43000]
10 |   sequence_parallel_enabled: False
11 |   is_continuous_batching: True
12 |   on_device_sampling_config: 
13 |     global_topk: 64
14 |     dynamic: True
15 |     deterministic: False
16 | device: "neuron"
17 | 


--------------------------------------------------------------------------------
/upload_hf_model.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from huggingface_hub import create_repo,upload_folder,login
 3 | 
 4 | hf_token=os.environ['HUGGINGFACE_TOKEN'].strip()
 5 | repo_id=os.environ['MODEL_ID']
 6 | login(hf_token, add_to_git_credential=True)
 7 | 
 8 | def push_compiled_model_to_hf(
 9 |     local_dir: str,
10 |     repo_id: str,
11 |     commit_message: str,
12 |     token: str = None,
13 | ):
14 |     create_repo(
15 |         repo_id=repo_id,
16 |         token=token,
17 |         exist_ok=True,
18 |         private=False
19 |     )
20 | 
21 |     upload_folder(
22 |         folder_path=local_dir,
23 |         path_in_repo="",
24 |         repo_id=repo_id,
25 |         commit_message=commit_message
26 |     )
27 | 
28 | 
29 | push_compiled_model_to_hf(
30 |   local_dir=repo_id,
31 |   repo_id=repo_id,
32 |   commit_message=f"Add NxD compiled model {repo_id} for vLLM; after converting checkpoints"
33 | )
34 | 
35 | 


--------------------------------------------------------------------------------
/cell_load_books.py:
--------------------------------------------------------------------------------
 1 | # =========================
 2 | #  Load Dataset
 3 | # =========================
 4 | import pandas as pd
 5 | import os
 6 | import kagglehub
 7 | import shutil
 8 | 
 9 | books_df_dataset=os.environ['BOOKS_DF_DS']
10 | nrows=os.environ['NROWS']
11 | 
12 | local_dir="./data"
13 | local_file=os.path.join(local_dir,"Books_rating.csv")
14 | if os.path.exists(local_file):
15 |     print("Books_rating.csv already exists locally. Skipping download.")
16 | else:
17 |   dataset_path = kagglehub.dataset_download("mohamedbakhet/amazon-books-reviews")
18 |   print("Path to dataset files:",dataset_path)
19 |   kaggle_file=os.path.join(dataset_path,"Books_rating.csv")
20 |   os.makedirs(local_dir,exist_ok=True)
21 |   shutil.copy(kaggle_file,local_file)
22 |   print(f"Copied Books_rating.csv to {local_file}")
23 | 
24 | df = pd.read_csv(local_file,nrows=int(nrows))
25 | books_df = df[['Id', 'Title', 'User_id', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']]
26 | books_df.to_pickle(books_df_dataset)  
27 | print(f"✅ Books dataset loaded & saved in {books_df_dataset}.")
28 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) Facebook, Inc. and its affiliates.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/compile_llm_and_encoders.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | set -eu
 3 | 
 4 | if [ -z "${HUGGINGFACE_TOKEN:-}" ]; then
 5 |   echo "HUGGINGFACE_TOKEN is not set."
 6 |   read -p "Please enter your Hugging Face token: " HUGGINGFACE_TOKEN
 7 |   export HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN
 8 |   echo
 9 |   if [ -z "$HUGGINGFACE_TOKEN" ]; then
10 |     echo "Error: no Hugging Face token provided. Exiting." >&2
11 |     exit 1
12 |   fi
13 | fi
14 | 
15 | cat nxd_vllm_1b.yaml \
16 | && export COMPILED_MODEL_ID="yahavb/nxd_vllm_1b" \
17 | && python cell_compile_vllm.py nxd_vllm_1b.yaml > nxd_vllm_1b.log
18 | 
19 | cat nxd_vllm_3b.yaml \
20 | && export COMPILED_MODEL_ID="yahavb/nxd_vllm_3b" \
21 | && python cell_compile_vllm.py nxd_vllm_3b.yaml > nxd_vllm_3b.log
22 | 
23 | cat nxd_vllm_8b.yaml \
24 | && export COMPILED_MODEL_ID="yahavb/nxd_vllm_8b" \
25 | && python cell_compile_vllm.py nxd_vllm_8b.yaml > nxd_vllm_8b.log
26 | 
27 | cat nxd_vllm_70b.yaml \
28 | && export COMPILED_MODEL_ID="yahavb/nxd_vllm_70b" \
29 | && python cell_compile_vllm.py nxd_vllm_70b.yaml > nxd_vllm_70b.log
30 | 
31 | export COMPILED_MODEL_ID="yahavb/t5-v1_1-base" && export MODEL_ID="google/t5-v1_1-base" && export MAX_SEQ_LEN=1024 && export TP_DEGREE=8 && python cell_compile_t5.py
32 | 
33 | export COMPILED_MODEL_ID="yahavb/t5-v1_1-large" && export MODEL_ID="google/t5-v1_1-large" && export MAX_SEQ_LEN=1024 && export TP_DEGREE=8 && python cell_compile_t5.py
34 | 
35 | export COMPILED_MODEL_ID="yahavb/t5-v1_1-xl" && export MODEL_ID="google/t5-v1_1-xl" && export MAX_SEQ_LEN=1024 && export TP_DEGREE=8 && python cell_compile_t5.py
36 | 


--------------------------------------------------------------------------------
/cell_compile_vllm.py:
--------------------------------------------------------------------------------
 1 | from vllm import LLM, SamplingParams
 2 | import yaml
 3 | import os
 4 | import sys
 5 | from huggingface_hub import create_repo,upload_folder,login
 6 | 
 7 | hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
 8 | compiled_model_id=os.environ['COMPILED_MODEL_ID']
 9 | os.environ['NEURON_COMPILED_ARTIFACTS']=compiled_model_id
10 | os.environ['VLLM_NEURON_FRAMEWORK']='neuronx-distributed-inference'
11 | 
12 | if len(sys.argv) <= 1:
13 |   print("Error: Please provide a path to a vLLM YAML configuration file.")
14 |   sys.exit(1)
15 | 
16 | config_path = sys.argv[1]
17 | with open(config_path, 'r') as f:
18 |   model_vllm_config_yaml = f.read()
19 | 
20 | login(hf_token,add_to_git_credential=True)
21 | 
22 | def push_compiled_model_to_hf(local_dir,repo_id,commit_message):
23 |   create_repo(repo_id=repo_id,exist_ok=True,private=False)
24 |   upload_folder(folder_path=local_dir,path_in_repo="",repo_id=repo_id,commit_message=commit_message)
25 | 
26 | model_vllm_config = yaml.safe_load(model_vllm_config_yaml)
27 | llm_model = LLM(**model_vllm_config)
28 | sampling_params = SamplingParams(top_k=1, temperature=1.0, max_tokens=64)
29 | prompt = "What is Annapurna Labs?"
30 | print(f"Running inference with prompt: '{prompt}'")
31 | outputs = llm_model.generate([prompt], sampling_params)
32 | for output in outputs:
33 |   print("Prompt:", output.prompt)
34 |   print("Generated text:", output.outputs[0].text)
35 | 
36 | push_compiled_model_to_hf(
37 |   local_dir=compiled_model_id,
38 |   repo_id=compiled_model_id,
39 |   commit_message=f"Add NxD compiled model {compiled_model_id} for vLLM")
40 | 
41 | print(f"✅  compilation was successful and stored in {compiled_model_id}!")
42 | 


--------------------------------------------------------------------------------
/cell_st_embeddings.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pandas as pd
 3 | import numpy as np
 4 | import faiss
 5 | from sentence_transformers import SentenceTransformer
 6 | 
 7 | books_df_dataset_expanded_interest_name=os.environ['BOOKS_DF_DS_EXP_INTEREST']
 8 | books_faiss_index=os.environ['BOOKS_DF_FAISS_IDX']
 9 | 
10 | books_df_dataset_expanded_interest = pd.read_pickle(books_df_dataset_expanded_interest_name)
11 | 
12 | print(f"Loaded dataset path: {os.environ['BOOKS_DF_DS_EXP_INTEREST']}")
13 | 
14 | 
15 | # Load SentenceTransformer model
16 | st_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
17 | 
18 | def get_st_embedding(text):
19 |     return st_model.encode(text).astype("float32")
20 | 
21 | # Compute embeddings
22 | books_df_dataset_expanded_interest["st_embedding"] = books_df_dataset_expanded_interest["expanded_interest"].apply(lambda x: get_st_embedding(x).tolist())
23 | 
24 | books_df_dataset_expanded_interest.to_pickle(books_df_dataset_expanded_interest_name)
25 | print(f"✅ Updated .pkl with 'st_embedding' column: {books_df_dataset_expanded_interest_name}")
26 | 
27 | # Convert to NumPy matrix
28 | st_matrix = np.array(books_df_dataset_expanded_interest["st_embedding"].tolist()).astype("float32")
29 | #np.save("st_embeddings.npy", st_matrix)  
30 | 
31 | # Create FAISS index
32 | faiss.normalize_L2(st_matrix)
33 | index_st = faiss.IndexFlatL2(st_matrix.shape[1])
34 | index_st.add(st_matrix)
35 | 
36 | faiss.write_index(index_st,books_faiss_index)
37 | 
38 | print(f"✅ SentenceTransformer embeddings computed and saved in {books_faiss_index}")
39 | xb = np.zeros((5, index_st.d), dtype=np.float32)
40 | index_st.reconstruct_n(0, 5, xb)
41 | print("✅ First 5 stored vectors (embeddings):")
42 | print(xb)
43 | 


--------------------------------------------------------------------------------
/cell_compile_t5.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from transformers import T5Tokenizer, T5EncoderModel
 3 | from neuronx_distributed.trace import parallel_model_trace, parallel_model_save, parallel_model_load
 4 | from pathlib import Path
 5 | import torch.multiprocessing as mp
 6 | from huggingface_hub import login
 7 | import time
 8 | import os
 9 | import shutil
10 | from huggingface_hub.hf_api import HfFolder
11 | from huggingface_hub import login,snapshot_download,HfApi
12 | 
13 | hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
14 | model_id = os.environ['MODEL_ID']
15 | compiled_model_id = os.environ['COMPILED_MODEL_ID']
16 | max_sequence_length = int(os.environ['MAX_SEQ_LEN'])
17 | tp_degree = int(os.environ['TP_DEGREE'])
18 | 
19 | def forward_wrapper():
20 |     model = T5EncoderModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
21 |     return model, {}
22 | 
23 | 
24 | if __name__ == '__main__':
25 | 
26 |     login(hf_token, add_to_git_credential=True)
27 | 
28 |     mp.set_start_method("spawn", force=True)
29 | 
30 |     prompt = "This is a test input for compilation."
31 |     compiled_model_path = Path(compiled_model_id)
32 | 
33 |     model = T5EncoderModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
34 |     tokenizer = T5Tokenizer.from_pretrained(model_id)
35 |     
36 |     sample_text = "This is a test input for compilation."
37 |     sample_inputs = tokenizer(
38 |         sample_text,
39 |         return_tensors="pt",
40 |         max_length=max_sequence_length,
41 |         truncation=True,
42 |         padding="max_length"
43 |     )
44 |     input_ids = sample_inputs["input_ids"]
45 |     attention_mask = sample_inputs["attention_mask"]
46 | 
47 |     sample_tensors = (input_ids, attention_mask)
48 |     
49 |     traced_model = parallel_model_trace(forward_wrapper,sample_tensors,tp_degree=tp_degree)
50 |     
51 |     parallel_model_save(traced_model, compiled_model_id)
52 |     print(f"Model compiled successfully! Uploading to {compiled_model_id}")
53 | 
54 |     api = HfApi()
55 |     api.create_repo(repo_id=compiled_model_id, exist_ok=True)
56 |     api.upload_folder(folder_path=compiled_model_id, repo_id=compiled_model_id, repo_type="model")
57 | 


--------------------------------------------------------------------------------
/cell_t5_embeddings.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import torch
 3 | import faiss
 4 | import pandas as pd
 5 | import numpy as np
 6 | from transformers import T5Tokenizer
 7 | from neuronx_distributed.trace import parallel_model_load
 8 | from huggingface_hub import snapshot_download
 9 | 
10 | books_df_dataset_expanded_interest_name=os.environ['BOOKS_DF_DS_EXP_INTEREST']
11 | books_faiss_index=os.environ['BOOKS_DF_FAISS_IDX']
12 | 
13 | books_df_dataset_expanded_interest = pd.read_pickle(books_df_dataset_expanded_interest_name)
14 | 
15 | print(f"Loaded dataset path: {os.environ['BOOKS_DF_DS_EXP_INTEREST']}")
16 | 
17 | 
18 | model_id=os.environ['MODEL_ID']
19 | repo_id=os.environ['COMPILED_MODEL_ID']
20 | max_sequence_length = int(os.environ['MAX_SEQ_LEN'])
21 | local_dir=snapshot_download(repo_id,allow_patterns="tp_*.pt")
22 | 
23 | t5_tokenizer = T5Tokenizer.from_pretrained(model_id)
24 | embedding_t5_model = parallel_model_load(local_dir)
25 | 
26 | def get_t5_embedding(text):
27 |     """
28 |     Create T5-based embeddings by extracting encoder hidden states.
29 |     Ensures inputs are always padded/truncated to the fixed 512 token size.
30 |     """
31 |     #print(f"Encoding text: {text[:100]}...")
32 |     inputs = t5_tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=max_sequence_length)
33 | 
34 |     with torch.no_grad():
35 |         output = embedding_t5_model(inputs["input_ids"], inputs["attention_mask"])
36 | 
37 |     if isinstance(output, dict):
38 |         last_hidden_state = output["last_hidden_state"]  # Extract correct tensor
39 |     else:
40 |         last_hidden_state = output  # Fallback if output isn't a dict (rare case)
41 | 
42 |     embedding = last_hidden_state.mean(dim=1).squeeze().to(torch.float32).cpu().numpy()
43 |     #print(f"Generated embedding (first 5 dims): {embedding[:5]}")
44 | 
45 |     return embedding
46 | 
47 | 
48 | # Generate T5 Embeddings
49 | books_df_dataset_expanded_interest["t5_embedding"] = books_df_dataset_expanded_interest["expanded_interest"].apply(lambda x: get_t5_embedding(x).tolist())
50 | 
51 | books_df_dataset_expanded_interest.to_pickle(books_df_dataset_expanded_interest_name)
52 | print(f"✅ Updated .pkl with 't5_embedding' column: {books_df_dataset_expanded_interest_name}")
53 | 
54 | # Create FAISS Index for T5
55 | t5_matrix = np.array(books_df_dataset_expanded_interest["t5_embedding"].tolist()).astype("float32")
56 | faiss.normalize_L2(t5_matrix)
57 | index_t5 = faiss.IndexFlatL2(t5_matrix.shape[1])
58 | index_t5.add(t5_matrix)
59 | 
60 | # Save to disk
61 | #np.save("t5_matrix.npy", t5_matrix)
62 | faiss.write_index(index_t5,books_faiss_index)
63 | print(f"✅ T5 Embeddings & FAISS index saved in {books_faiss_index}")
64 | xb = np.zeros((5, index_t5.d), dtype=np.float32)
65 | index_t5.reconstruct_n(0, 5, xb)
66 | print("✅ First 5 stored vectors (embeddings):")
67 | print(xb)
68 | 
69 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Cold-Start Recommendations with vLLM and AWS Trainium
 2 | 
 3 | This repo demonstrates how to solve the **cold-start problem in recommendation systems** using **large language models (LLMs)** on **AWS Trainium (Trn1)** with **vLLM**, **Neuron SDK**, and **FAISS** for semantic retrieval. It features multi-LLM comparison (DeepSeek LLaMA 8B vs. 70B), structured prompting for interest expansion, and high-performance inference using **NeuronX Distributed (NxD)**.
 4 | 
 5 | ## 🚀 About
 6 | 
 7 | End-to-end solution for cold-start recommendations using **vLLM**, **DeepSeek LLaMA (8B & 70B)**, and **FAISS** on **AWS Trainium (Trn1)** with the **Neuron SDK** and **NeuronX Distributed**. Includes LLM-based interest expansion, embedding comparisons (T5 & SentenceTransformers), and scalable retrieval workflows.
 8 | 
 9 | ## 🛠 Tech Stack
10 | 
11 | - **Inference**: [vLLM](https://github.com/vllm-project/vllm), DeepSeek LLaMA (8B/70B)
12 | - **Hardware**: AWS EC2 Trn1 (Trainium), Neuron SDK, NeuronX Distributed (NxD)
13 | - **Embeddings**: SentenceTransformers, T5 Encoder
14 | - **Retrieval**: FAISS (Facebook AI Similarity Search)
15 | - **Frameworks**: PyTorch, Hugging Face Transformers
16 | - **Data**: Amazon Books (from the Amazon Reviews dataset)
17 | 
18 | ## 📦 Project Structure
19 | 
20 | - **notebooks/**
21 |   - **BookExpanSim.ipynb**  
22 |     Uses vLLM to generate expanded user interests from minimal input.
23 |     Converts interests and content into embeddings and builds FAISS indices.
24 |     Retrieves recommendations using FAISS and compares outputs from different LLMs.
25 | 
26 | - **scripts/**
27 |   - **compile_llm_and_encoders.sh**
28 |     Script to compile `meta-llama/Llama-3.2-1B`, `meta-llama/Llama-3.2-3B`, `meta-llama/Llama-3.1-8B-Instruct` and `deepseek-ai/DeepSeek-R1-Distill-Llama-70B` LLMs using vLLM with NeuronX Distributed on AWS Trainium.
29 |   - **expand_interest_generate_faiss_index.sh**
30 |     Script to expand user interest with FAISS index creation
31 |   - **cell_*.py** 
32 |     Cell scripts that expands user interest and similarity index creation
33 | 
34 | 
35 | ## ⚙️ Quickstart
36 | 
37 | 1. **Launch a Trn1 instance** with AWS Deep Learning Containers (DLC) and Neuron SDK pre-installed.
38 | 2. **Install Python dependencies:**
39 |    ```bash
40 |    pip install --upgrade pip
41 |    pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
42 |    pip install --upgrade neuronx-cc transformers_neuronx neuronx_distributed transformers torch-neuronx accelerate triton protobuf sentence_transformers
43 |    git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
44 |    cd upstreaming-to-vllm
45 |    pip install -r requirements-neuron.txt
46 |    VLLM_TARGET_DEVICE="neuron" && pip install -e .
47 |    pip install --upgrade "transformers==4.45.2"
48 |    ```
49 | 
50 |    **Clone the Repository**  
51 |    ```bash
52 |    git clone https://github.com/aws-samples/coldstart-recs-on-aws-trainium.git
53 |    cd coldstart-recs-on-aws-trainium
54 |    ```
55 | 
56 | 3. **Run the Jupyter Notebooks**  
57 |    Start Jupyter Notebook to run the interactive examples:
58 |    ```bash
59 |    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
60 |    ```
61 | 
62 | 4. **Run the model compile script**  
63 |    Run the script:
64 |    ```bash
65 |    compile_llm_and_encoders.sh > compile_llm_and_encoders.log 2>&1 &
66 |    ```
67 | 
68 | 5. **Run the user interest expansion script**
69 |    Edit the `NROWS` to the desired number of rows to include from the Amazon Book Review Dataset
70 |    ```bash
71 |    expand_interest_generate_faiss_index.sh > expand_interest_generate_faiss_index.log 2>&1 &
72 |    ```
73 | 
74 | 6. Then, execute the scrupts and cell in the notebooks in this order:
75 |    - **.BookExpanSim.ipynb**: Retrieve recommendations and compare results from multiple LLMs.
76 | 


--------------------------------------------------------------------------------
/cell_expand_interest_llm.py:
--------------------------------------------------------------------------------
  1 | from vllm import LLM, SamplingParams
  2 | import yaml
  3 | import os
  4 | import sys
  5 | import torch_neuronx
  6 | import pandas as pd
  7 | from huggingface_hub import create_repo,upload_folder,login,snapshot_download
  8 | from tqdm import tqdm
  9 | import time
 10 | import math
 11 | 
 12 | hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
 13 | books_df_dataset=os.environ['BOOKS_DF_DS']
 14 | books_df_dataset_expanded_interest=os.environ['BOOKS_DF_DS_EXP_INTEREST']
 15 | repo_id=os.environ['MODEL_ID']
 16 | repo_dir=repo_id
 17 | os.environ['NEURON_COMPILED_ARTIFACTS']=repo_id
 18 | os.environ['VLLM_NEURON_FRAMEWORK']='neuronx-distributed-inference'
 19 | 
 20 | login(hf_token,add_to_git_credential=True)
 21 | 
 22 | 
 23 | #snapshot_download(repo_id=repo_id,local_dir=repo_dir)
 24 | #print(f"Repository '{repo_id}' downloaded to '{repo_dir}'.")
 25 | 
 26 | books_df = pd.read_pickle(books_df_dataset)
 27 | print(f"Loaded the dataset {books_df_dataset}")
 28 | 
 29 | if len(sys.argv) <= 1:
 30 |     print("Error: Please provide a path to a YAML configuration file.")
 31 |     sys.exit(1)
 32 | 
 33 | config_path = sys.argv[1]
 34 | with open(config_path, 'r') as f:
 35 |     model_vllm_config_yaml = f.read()
 36 | 
 37 | class LatencyCollector:
 38 |     def __init__(self):
 39 |         self.latency_list = []
 40 | 
 41 |     def record(self, latency_sec):
 42 |         self.latency_list.append(latency_sec)
 43 | 
 44 |     def percentile(self, percent):
 45 |         if not self.latency_list:
 46 |             return 0.0
 47 |         latency_list = sorted(self.latency_list)
 48 |         pos_float = len(latency_list) * percent / 100
 49 |         max_pos = len(latency_list) - 1
 50 |         pos_floor = min(math.floor(pos_float), max_pos)
 51 |         pos_ceil = min(math.ceil(pos_float), max_pos)
 52 |         return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]
 53 | 
 54 |     def report(self, test_name="Batch Inference"):
 55 |         print(f"\n📊 LATENCY REPORT for {test_name}")
 56 |         for p in [0, 50, 90, 95, 99, 100]:
 57 |             value = self.percentile(p) * 1000
 58 |             print(f"Latency P{p}: {value:.2f} ms")
 59 | 
 60 | latency_collector = LatencyCollector()
 61 | 
 62 | model_vllm_config = yaml.safe_load(model_vllm_config_yaml)
 63 | llm_model = LLM(**model_vllm_config)
 64 | 
 65 | def expand_interest(user_reviews):
 66 |     """
 67 |     Expands the user's book interest using LLM reasoning.
 68 |     Ensures the returned type is a valid string for embeddings.
 69 |     """
 70 |     prompt = f"Expand the following book interests: {user_reviews}"
 71 |     sampling_params = SamplingParams(top_k=64)
 72 |     request_output = llm_model.generate([prompt],sampling_params=sampling_params)[0]
 73 |     # Extract text from request_output
 74 |     if request_output.outputs and hasattr(request_output.outputs[0], 'text'):
 75 |         expanded_text = request_output.outputs[0].text
 76 |     else:
 77 |         expanded_text = str(request_output)
 78 |     return expanded_text.strip()
 79 | 
 80 | # Test the expand_interest function
 81 | test_user_interest = "I love epic sci-fi novels with deep world-building."
 82 | expanded_test_interest = expand_interest(test_user_interest)
 83 | print("✅ Expanded Interest Example:", expanded_test_interest)
 84 | # Expand interests for all users
 85 | # books_df["expanded_interest"] = books_df["review/text"].apply(expand_interest)
 86 | 
 87 | batch_size = 8  # Adjust based on memory availability
 88 | MAX_PROMPT_LEN = 512
 89 | 
 90 | expanded_interests = []
 91 | 
 92 | # Convert reviews to list for slicing
 93 | all_reviews = books_df["review/text"].tolist()
 94 | 
 95 | # Process in batches
 96 | for i in tqdm(range(0, len(all_reviews), batch_size), desc="Expanding interests"):
 97 |     batch = all_reviews[i:i + batch_size]
 98 |     prompts = [f"Expand the following book interests: {review[:MAX_PROMPT_LEN]}" for review in batch]
 99 | 
100 |     try:
101 |         sampling_params = SamplingParams(top_k=64)
102 |         print(f"⏳ Batch {i}-{i+len(batch)}: sending {len(prompts)} prompts")
103 |         start_time = time.time()
104 |         results = llm_model.generate(prompts, sampling_params=sampling_params)
105 |         latency_sec = time.time() - start_time
106 |         latency_collector.record(latency_sec)
107 |         print(f"✅ Batch {i}-{i+len(batch)} complete in {latency_sec:.2f}s")
108 | 
109 |         for output in results:
110 |             if output.outputs and hasattr(output.outputs[0], 'text'):
111 |                 expanded_text = output.outputs[0].text.strip()
112 |             else:
113 |                 expanded_text = str(output)
114 |             expanded_interests.append(expanded_text)
115 | 
116 |     except Exception as e:
117 |         print(f"⚠️ Error during batch {i}-{i+batch_size}: {e}")
118 |         expanded_interests.extend(["<ERROR: skipped>"] * len(batch))
119 |         with open("skipped_batches.txt", "a") as logf:
120 |             logf.write(f"\n--- Skipped batch {i}-{i+len(batch)} ---\n")
121 |             for review in batch:
122 |                 logf.write(review + "\n")
123 |         # maintain DataFrame alignment
124 |         # expanded_interests.extend([""] * len(batch))
125 | 
126 | # Assign the expanded results to DataFrame
127 | books_df["expanded_interest"] = expanded_interests
128 | 
129 | books_df.to_pickle(books_df_dataset_expanded_interest)
130 | print(f"✅ Expanded Interests Generated in {books_df_dataset_expanded_interest}!")
131 | latency_collector.report("User Interest Expansion")
132 | 


--------------------------------------------------------------------------------
/expand_interest_generate_faiss_index.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/sh
  2 | set -eu
  3 | 
  4 | if [ -z "${HUGGINGFACE_TOKEN:-}" ]; then
  5 |   echo "HUGGINGFACE_TOKEN is not set."
  6 |   read -p "Please enter your Hugging Face token: " HUGGINGFACE_TOKEN
  7 |   export HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN
  8 |   echo
  9 |   if [ -z "$HUGGINGFACE_TOKEN" ]; then
 10 |     echo "Error: no Hugging Face token provided. Exiting." >&2
 11 |     exit 1
 12 |   fi
 13 | fi
 14 | 
 15 | export BOOKS_DF_DS="books_df.pkl"; export NROWS="100" \
 16 | && python cell_load_books.py
 17 | 
 18 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_1b.pkl" \
 19 | && export BOOKS_DF_DS="books_df.pkl" \
 20 | && export MODEL_ID="yahavb/nxd_vllm_1b" \
 21 | && python cell_expand_interest_llm.py nxd_vllm_1b.yaml > expanded_interest_books_nxd_vllm_1b.log 
 22 | 
 23 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_3b.pkl" \
 24 | && export BOOKS_DF_DS="books_df.pkl" \
 25 | && export MODEL_ID="yahavb/nxd_vllm_3b" \
 26 | && python cell_expand_interest_llm.py nxd_vllm_3b.yaml > expanded_interest_books_nxd_vllm_3b.log 
 27 | 
 28 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_8b.pkl" \
 29 | && export BOOKS_DF_DS="books_df.pkl" \
 30 | && export MODEL_ID="yahavb/nxd_vllm_8b" \
 31 | && python cell_expand_interest_llm.py nxd_vllm_8b.yaml > expanded_interest_books_nxd_vllm_8b.log 
 32 | 
 33 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_70b.pkl" \
 34 | && export BOOKS_DF_DS="books_df.pkl" \
 35 | && export MODEL_ID="yahavb/nxd_vllm_70b" \
 36 | && python cell_expand_interest_llm.py nxd_vllm_70b.yaml > expanded_interest_books_nxd_vllm_70b.log 
 37 | 
 38 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_1b.pkl" \
 39 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_1b_t5_base_faiss.index" \
 40 | && export MODEL_ID="google/t5-v1_1-base" \
 41 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-base" \
 42 | && export MAX_SEQ_LEN=1024 \
 43 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_1b_t5_base_faiss.log 
 44 | 
 45 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_1b.pkl" \
 46 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_1b_t5_large_faiss.index" \
 47 | && export MODEL_ID="google/t5-v1_1-base" \
 48 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-large" \
 49 | && export MAX_SEQ_LEN=1024 \
 50 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_1b_t5_large_faiss.log 
 51 | 
 52 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_1b.pkl" \
 53 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_1b_t5_xl_faiss.index" \
 54 | && export MODEL_ID="google/t5-v1_1-xl" \
 55 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-xl" \
 56 | && export MAX_SEQ_LEN=1024 \
 57 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_1b_t5_xl_faiss.log 
 58 | 
 59 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_3b.pkl" \
 60 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_3b_t5_base_faiss.index" \
 61 | && export MODEL_ID="google/t5-v1_1-base" \
 62 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-base" \
 63 | && export MAX_SEQ_LEN=1024 \
 64 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_3b_t5_base_faiss.log 
 65 | 
 66 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_3b.pkl" \
 67 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_3b_t5_large_faiss.index" \
 68 | && export MODEL_ID="google/t5-v1_1-base" \
 69 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-large" \
 70 | && export MAX_SEQ_LEN=1024 \
 71 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_3b_t5_large_faiss.log 
 72 | 
 73 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_3b.pkl" \
 74 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_3b_t5_xl_faiss.index" \
 75 | && export MODEL_ID="google/t5-v1_1-xl" \
 76 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-xl" \
 77 | && export MAX_SEQ_LEN=1024 \
 78 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_3b_t5_xl_faiss.log 
 79 | 
 80 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_8b.pkl" \
 81 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_8b_t5_base_faiss.index" \
 82 | && export MODEL_ID="google/t5-v1_1-base" \
 83 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-base" \
 84 | && export MAX_SEQ_LEN=1024 \
 85 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_8b_t5_base_faiss.log 
 86 | 
 87 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_8b.pkl" \
 88 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_8b_t5_large_faiss.index" \
 89 | && export MODEL_ID="google/t5-v1_1-base" \
 90 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-large" \
 91 | && export MAX_SEQ_LEN=1024 \
 92 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_8b_t5_large_faiss.log 
 93 | 
 94 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_8b.pkl" \
 95 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_8b_t5_xl_faiss.index" \
 96 | && export MODEL_ID="google/t5-v1_1-xl" \
 97 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-xl" \
 98 | && export MAX_SEQ_LEN=1024 \
 99 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_8b_t5_xl_faiss.log 
100 | 
101 | 
102 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_70b.pkl" \
103 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_70b_t5_base_faiss.index" \
104 | && export MODEL_ID="google/t5-v1_1-base" \
105 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-base" \
106 | && export MAX_SEQ_LEN=1024 \
107 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_70b_t5_base_faiss.log 
108 | 
109 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_70b.pkl" \
110 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_70b_t5_large_faiss.index" \
111 | && export MODEL_ID="google/t5-v1_1-base" \
112 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-large" \
113 | && export MAX_SEQ_LEN=1024 \
114 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_70b_t5_large_faiss.log 
115 | 
116 | export BOOKS_DF_DS_EXP_INTEREST="expanded_interest_books_nxd_vllm_70b.pkl" \
117 | && export BOOKS_DF_FAISS_IDX="expanded_interest_books_nxd_vllm_70b_t5_xl_faiss.index" \
118 | && export MODEL_ID="google/t5-v1_1-xl" \
119 | && export COMPILED_MODEL_ID="yahavb/t5-v1_1-xl" \
120 | && export MAX_SEQ_LEN=1024 \
121 | && python cell_t5_embeddings.py > expanded_interest_books_nxd_vllm_70b_t5_xl_faiss.log 
122 | 


--------------------------------------------------------------------------------
/mllama-offline.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import time
  3 | import torch
  4 | import os
  5 | import sys
  6 | import yaml
  7 | import requests
  8 | from PIL import Image
  9 | from vllm import LLM, SamplingParams, TextPrompt
 10 | from neuronx_distributed_inference.models.mllama.utils import add_instruct
 11 | from huggingface_hub import create_repo,upload_folder,login,snapshot_download
 12 | from transformers import AutoTokenizer
 13 | 
 14 | hf_token = os.environ['HUGGINGFACE_TOKEN'].strip()
 15 | model_id=os.environ['MODEL_ID']
 16 | os.environ['NEURON_COMPILED_ARTIFACTS']=model_id
 17 | os.environ['VLLM_NEURON_FRAMEWORK']='neuronx-distributed-inference'
 18 | login(hf_token,add_to_git_credential=True)
 19 | 
 20 | tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
 21 | 
 22 | if len(sys.argv) <= 1:
 23 |     print("Error: Please provide a path to a YAML configuration file.")
 24 |     sys.exit(1)
 25 | 
 26 | config_path = sys.argv[1]
 27 | with open(config_path, 'r') as f:
 28 |     model_vllm_config_yaml = f.read()
 29 | 
 30 | model_vllm_config = yaml.safe_load(model_vllm_config_yaml)
 31 | 
 32 | class LatencyCollector:
 33 |     def __init__(self):
 34 |         self.latency_list = []
 35 |         self.rps_list= []
 36 |         self.in_tokens_list= []
 37 |         self.out_tokens_list= []
 38 | 
 39 | 
 40 |     def record(self, latency_sec, rps=None, in_tokens=None, out_tokens=None):
 41 |         self.latency_list.append(latency_sec)
 42 |         if rps is not None: self.rps_list.append(rps)
 43 |         if in_tokens  is not None: self.in_tokens_list.append(in_tokens)
 44 |         if out_tokens is not None: self.out_tokens_list.append(out_tokens)
 45 | 
 46 | 
 47 |     def percentile(self, percent):
 48 |         if not self.latency_list:
 49 |             return 0.0
 50 |         latency_list = sorted(self.latency_list)
 51 |         pos_float = len(latency_list) * percent / 100
 52 |         max_pos = len(latency_list) - 1
 53 |         pos_floor = min(math.floor(pos_float), max_pos)
 54 |         pos_ceil = min(math.ceil(pos_float), max_pos)
 55 |         return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]
 56 | 
 57 |     def report(self, test_name="Batch Inference"):
 58 |         print(f"\n📊 TEST REPORT for {test_name}")
 59 |         total = len(self.latency_list)
 60 |         for p in [0, 50, 90, 95, 99, 100]:
 61 |             value = self.percentile(p) * 1000
 62 |             print(f"Latency P{p}: {value:.2f} ms")
 63 |         if self.rps_list:
 64 |             avg_rps = sum(self.rps_list)/total
 65 |             print(f"⏱️  Requests/sec  avg: {avg_rps:.2f},  min: {min(self.rps_list):.2f},  max: {max(self.rps_list):.2f}")
 66 |         if self.in_tokens_list:
 67 |             avg_in = sum(self.in_tokens_list)/total
 68 |             print(f"🔤 Input tokens   avg: {avg_in:.1f},  min: {min(self.in_tokens_list)},  max: {max(self.in_tokens_list)}")
 69 |         if self.out_tokens_list:
 70 |             avg_out = sum(self.out_tokens_list)/total
 71 |             print(f"🔡 Output tokens  avg: {avg_out:.1f},  min: {min(self.out_tokens_list)},  max: {max(self.out_tokens_list)}")
 72 |         print(f"🔢 Total executions: {total}")
 73 | 
 74 | def get_image(image_url):
 75 |     image = Image.open(requests.get(image_url, stream=True).raw)
 76 |     return image
 77 | 
 78 | # Model Inputs
 79 | PROMPTS = ["What is in this image? Tell me a story",
 80 |             "What is the recipe of mayonnaise in two sentences?" ,
 81 |             "Describe this image",
 82 |             "What is the capital of Italy famous for?",
 83 |           ]
 84 | IMAGES = [get_image("https://github.com/meta-llama/llama-models/blob/main/models/resources/dog.jpg?raw=true"),
 85 |           torch.empty((0,0)),
 86 |           get_image("https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/nxd-inference-block-diagram.jpg"),
 87 |           torch.empty((0,0)),
 88 |          ]
 89 | SAMPLING_PARAMS = [dict(top_k=1, temperature=1.0, top_p=1.0, max_tokens=256),
 90 |                    dict(top_k=1, temperature=0.9, top_p=1.0, max_tokens=256),
 91 |                    dict(top_k=10, temperature=0.9, top_p=0.5, max_tokens=512),
 92 |                    dict(top_k=10, temperature=0.75, top_p=0.5, max_tokens=1024),
 93 |                   ]
 94 | 
 95 | 
 96 | def get_VLLM_mllama_model_inputs(prompt, single_image, sampling_params):
 97 |     input_image = single_image
 98 |     has_image = torch.tensor([1])
 99 |     if isinstance(single_image, torch.Tensor) and single_image.numel() == 0:
100 |         has_image = torch.tensor([0])
101 | 
102 |     instruct_prompt = add_instruct(prompt, has_image)
103 |     inputs = TextPrompt(prompt=instruct_prompt)
104 |     inputs["multi_modal_data"] = {"image": input_image}
105 |     # Create a sampling params object.
106 |     sampling_params = SamplingParams(**sampling_params)
107 |     return inputs, sampling_params
108 | 
109 | def warmup_model(model, calls: int = 5,collector=None):
110 |     """
111 |     Run a few dummy inferences over all prompt/image pairs
112 |     to compile kernels and fill caches before measuring.
113 |     """
114 |     print(f"🔄 Warming up model with {calls} full passes…")
115 |     for _ in range(calls):
116 |         for pmpt, img, params in zip(PROMPTS, IMAGES, SAMPLING_PARAMS):
117 |             inp, sp = get_VLLM_mllama_model_inputs(pmpt, img, params)
118 |             _ = model.generate(inp, sp)
119 |     print("✅ Warm-up complete.\n")
120 | 
121 | 
122 | def print_outputs(outputs):
123 |     # Print the outputs.
124 |     for output in outputs:
125 |         prompt = output.prompt
126 |         generated_text = output.outputs[0].text
127 |         print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
128 | 
129 | 
130 | llm_model = LLM(**model_vllm_config)
131 | 
132 | assert len(PROMPTS) == len(IMAGES) == len(SAMPLING_PARAMS), \
133 | f"""Text, image prompts and sampling parameters should have the same batch size,
134 |     got {len(PROMPTS)}, {len(IMAGES)}, and {len(SAMPLING_PARAMS)}"""
135 | 
136 | warmup_model(llm_model, calls=3)
137 | latency_collector = LatencyCollector()
138 | tokenizer= AutoTokenizer.from_pretrained(model_id, use_fast=True)
139 | in_tokens_list  = []
140 | out_tokens_list = []
141 | rps_list        = []
142 | 
143 | for i in range(1,5):
144 |   for pmpt, img, params in zip(PROMPTS, IMAGES, SAMPLING_PARAMS):
145 |         inputs, sampling_params = get_VLLM_mllama_model_inputs(pmpt, img, params)
146 |         start_time = time.time()
147 |         outputs = llm_model.generate(inputs, sampling_params)
148 |         latency_sec = time.time() - start_time
149 | 
150 |         rps = 1.0/latency_sec if latency_sec>0 else 0.0
151 |         in_count  = len(tokenizer(pmpt, add_special_tokens=False)["input_ids"])
152 |         if isinstance(img, Image.Image):
153 |             patch_size = 16
154 |             w, h = img.size
155 |             num_patches = (h // patch_size) * (w // patch_size)
156 |             in_count += num_patches
157 |         out_text  = outputs[0].outputs[0].text
158 |         out_count = len(tokenizer(out_text, add_special_tokens=False)["input_ids"])
159 | 
160 |         latency_collector.record(latency_sec,rps=rps,in_tokens=in_count, out_tokens=out_count)
161 |         print_outputs(outputs)
162 | 
163 | latency_collector.report(model_id)
164 | 


--------------------------------------------------------------------------------
/t5-offline.py:
--------------------------------------------------------------------------------
  1 | import os, math
  2 | import torch
  3 | from transformers import T5Tokenizer
  4 | from neuronx_distributed.trace import parallel_model_load
  5 | from huggingface_hub import snapshot_download
  6 | import time
  7 | 
  8 | model_id=os.environ['MODEL_ID']
  9 | repo_id=os.environ['COMPILED_MODEL_ID']
 10 | local_dir=snapshot_download(repo_id,allow_patterns="tp_*.pt")
 11 | max_sequence_length = int(os.environ['MAX_SEQ_LEN'])
 12 | t5_tokenizer = T5Tokenizer.from_pretrained(model_id)
 13 | t5_tokenizer.model_max_length = max_sequence_length
 14 | embedding_t5_model = parallel_model_load(local_dir)
 15 | 
 16 | 
 17 | class LatencyCollector:
 18 |     def __init__(self):
 19 |         self.latency_list = []
 20 |         self.rps_list= []
 21 |         self.in_tokens_list= []
 22 |         self.out_tokens_list= []
 23 | 
 24 | 
 25 |     def record(self, latency_sec, rps=None, in_tokens=None, out_tokens=None):
 26 |         self.latency_list.append(latency_sec)
 27 |         if rps is not None: self.rps_list.append(rps)
 28 |         if in_tokens  is not None: self.in_tokens_list.append(in_tokens)
 29 |         if out_tokens is not None: self.out_tokens_list.append(out_tokens)
 30 | 
 31 | 
 32 |     def percentile(self, percent):
 33 |         if not self.latency_list:
 34 |             return 0.0
 35 |         latency_list = sorted(self.latency_list)
 36 |         pos_float = len(latency_list) * percent / 100
 37 |         max_pos = len(latency_list) - 1
 38 |         pos_floor = min(math.floor(pos_float), max_pos)
 39 |         pos_ceil = min(math.ceil(pos_float), max_pos)
 40 |         return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor]
 41 | 
 42 |     def report(self, test_name="Batch Inference"):
 43 |         print(f"\n📊 TEST REPORT for {test_name}")
 44 |         total = len(self.latency_list)
 45 |         for p in [0, 50, 90, 95, 99, 100]:
 46 |             value = self.percentile(p) * 1000
 47 |             print(f"Latency P{p}: {value:.2f} ms")
 48 |         if self.rps_list:
 49 |             avg_rps = sum(self.rps_list)/total
 50 |             print(f"⏱️   Requests/sec  avg: {avg_rps:.2f},  min: {min(self.rps_list):.2f},  max: {max(self.rps_list):.2f}")
 51 |         if self.in_tokens_list:
 52 |             avg_in = sum(self.in_tokens_list)/total
 53 |             print(f"🔤 Input tokens   avg: {avg_in:.1f},  min: {min(self.in_tokens_list)},  max: {max(self.in_tokens_list)}")
 54 |         if self.out_tokens_list:
 55 |             avg_out = sum(self.out_tokens_list)/total
 56 |             print(f"🔡 Output tokens  avg: {avg_out:.1f},  min: {min(self.out_tokens_list)},  max: {max(self.out_tokens_list)}")
 57 |         print(f"🔢 Total executions: {total}")
 58 | 
 59 | def get_t5_embedding(text):
 60 |     #print(f"Encoding text: {text[:100]}...")
 61 |     inputs = t5_tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=max_sequence_length)
 62 | 
 63 |     with torch.no_grad():
 64 |         output = embedding_t5_model(inputs["input_ids"], inputs["attention_mask"])
 65 | 
 66 |     if isinstance(output, dict):
 67 |         last_hidden_state = output["last_hidden_state"]  # Extract correct tensor
 68 |     else:
 69 |         last_hidden_state = output  # Fallback if output isn't a dict (rare case)
 70 | 
 71 |     embedding = last_hidden_state.mean(dim=1).squeeze().to(torch.float32).cpu().numpy()
 72 |     #print(f"Generated embedding (first 5 dims): {embedding[:5]}")
 73 |     return embedding
 74 | 
 75 | def warmup_model(model, calls: int = 5,collector=None):
 76 |     print(f"🔄 Warming up model with {calls} full passes…")
 77 |     for _ in range(calls):
 78 |         for pmpt in PROMPTS:
 79 |             get_t5_embedding(pmpt)
 80 |     print("✅ Warm-up complete.\n")
 81 | 
 82 | PROMPTS = ["The image is a close-up photograph of a small, fluffy dog with its tongue out. The dog has light-brown fur and dark eyes. Its black nose is prominent, and its tongue is sticking out of its mouth. The dog appears to be a young puppy, and its fur is shaggy and unkempt. The background is blurred, but it appears to be a green outdoor setting. The overall atmosphere of the image is playful and happy, as the dog's tongue out and happy expression suggest that it is enjoying itself.","The image is a blurry photograph of three people walking through a field on a foggy day. Three people are walking through a field of tall, yellow grass. The people are in the center of the image, and they are blurry and indistinct. There are trees in the background, and the sky is foggy and yellow. The atmosphere is peaceful and serene, with the fog adding a sense of mystery to the scene.",
 83 |             "You\'re likely thinking of Rome, the capital of Italy! Rome is famous for many things, including:\n\n1. **Ancient History and Architecture**: Rome is home to numerous ancient ruins, such as the Colosseum, the Roman Forum, and the Pantheon, which showcase the city\'s rich history and engineering prowess.\n2. **Vatican City**: The Vatican, an independent city-state within Rome, is home to the Pope and the central government of the Catholic Church. The Vatican Museums and Sistine Chapel are world-renowned for their art and architecture.\n3. **Food and Wine**: Rome is famous for its delicious cuisine, including dishes like carbonara, amatriciana, and cacio e pepe. The city is also known for its wine production, particularly the Frascati and Castelli Romani wines.\n4. **Art and Culture**: Rome has a vibrant arts scene, with numerous museums, galleries, and festivals throughout the year. The city is also home to the famous Trevi Fountain, Spanish Steps, and Piazza Navona.\n5. **Fashion and Design**: Rome is a hub for fashion and design, with many high-end fashion brands and designers calling the city home.\n6. **History of the Roman Empire**: Rome is the birthplace of the Roman Empire, and the city\'s history is still visible in its architecture, art, and culture.\n7. **Papal History**: Rome has been the center of the Catholic Church for centuries, and the city is home to many important papal landmarks, such as St. Peter\'s Basilica and the Vatican Library.\n8. **Outdoor Spaces**: Rome has many beautiful parks and gardens, such as the Villa Borghese and the Orto Botanico, which offer a peaceful escape from the city\'s bustling streets.\n9. **Nightlife**: Rome has a lively nightlife scene, with many bars, clubs, and live music venues to choose from.\n10. **Film and Media**: Rome has been the setting for many famous films, including La Dolce Vita, Roman Holiday, and Gladiator, and the city is often referred to as the \"Hollywood of Europe.\"\n\nThese are just a few examples of what Rome is famous for. The city has a rich history, culture, and beauty that makes it a must-visit destination for anyone interested in exploring Italy.",
 84 |             "The image features a small, tan and white dog standing on a skateboard. The dog's short, fluffy fur is predominantly tan, with a white chest and black nose. Its ears are pointed upright, and its tail is curled up and to the left. The dog's front paws are positioned on the board, while its back paws are raised off the ground. The skateboard has red wheels and a light-colored wooden deck. The background of the image is a plain, off-white color, with a subtle shadow cast by the dog and skateboard. The overall atmosphere of the image is playful and fun, with the dog appearing to be enjoying itself on the skateboard."
 85 |           ]
 86 | 
 87 | warmup_model(embedding_t5_model,3)
 88 | 
 89 | in_tokens_list  = []
 90 | out_tokens_list = []
 91 | rps_list        = []
 92 | 
 93 | latency_collector = LatencyCollector()
 94 | 
 95 | for i in range(1,50):
 96 |   for pmpt in PROMPTS:
 97 |     start_time = time.time()
 98 |     outputs=get_t5_embedding(pmpt)
 99 |     latency_sec = time.time() - start_time
100 |     rps = 1.0/latency_sec if latency_sec>0 else 0.0
101 |     in_count  = len(t5_tokenizer(pmpt, add_special_tokens=False)["input_ids"])
102 |     #out_text  = outputs[0].text
103 |     #out_count = len(t5_tokenizer(out_text, add_special_tokens=False)["input_ids"])
104 |     latency_collector.record(latency_sec,rps=rps,in_tokens=in_count,out_tokens=in_count)
105 | 
106 | latency_collector.report(model_id)
107 | 


--------------------------------------------------------------------------------
/study.md:
--------------------------------------------------------------------------------
  1 | # Solving the Cold-Start Problem in Recommendations with vLLM on AWS Trainium 
  2 | 
  3 | **Introduction: The Cold‑Start Challenge**
  4 | Cold start is a critical pain point in recommendation engines: without historical behavior, new users or items get lumped into “generic” buckets, hurting engagement and retention. Traditional approaches like collaborative filtering or matrix factorization simply don’t have enough signal to personalize effectively, and popularity‑based fallbacks often feel stale. What if you could synthesize richer user profiles on day one—without waiting weeks for interaction data? We propose using large language models (LLMs) to enrich sparse user interest profiles via zero‑shot reasoning, enabling meaningful recommendations without extensive per‑user training.
  5 | 
  6 | In this blog, we show how to deploy LLMs on Amazon EC2 Trainium instances using AWS Deep Learning Containers (DLC), the Neuron SDK, and NeuronX Distributed (NxD) for scalable, cost‑efficient inference. You’ll learn how to generate interest expansions via structured prompts, encode them with high‑quality embeddings, retrieve candidates with FAISS, and apply simple validation to keep results grounded. This approach empowers ML engineers to frame the cold‑start challenge as a scientific experiment—quantitatively benchmarking LLM and encoder configurations on AWS Trainium, iterating rapidly on recommendation quality metrics, and demonstrating clear ROI for each model×encoder choice.
  7 | 
  8 | #### Architecture Overview
  9 | 
 10 | We build our cold‑start solution on Amazon EC2 Trainium (Trn1) instances, which are purpose‑built for high‑throughput generative AI and deliver up to 50 % lower inference costs compared to GPU‑based instances. To streamline model deployment, we leverage AWS Deep Learning Containers preconfigured with the Neuron SDK, which compiles PyTorch models into Neuron‑optimized code and provides the correct runtimes and drivers for Trainium out of the box.
 11 | To scale across large models that exceed a single accelerator’s memory, we integrate NeuronX Distributed (NxD) with the vLLM inference library. NxD handles model sharding across all 32 NeuronCores in a Trn1 instance—or across multiple instances—with minimal code changes, enabling parallel inference of even 70 B‑parameter LLMs. This combination—Trainium hardware, Neuron SDK/DLC, and NxD—gives ML engineers a flexible, cost‑efficient, and production‑ready platform for experimenting with different LLM and encoder configurations, delivering rapid iteration on recommendation quality metrics without modifying core model code.
 12 | 
 13 | We orchestrate our experiments and present results in a Jupyter notebook, allowing reproducible end‑to‑end workflows—from data loading and prompt engineering to embedding generation and FAISS‑based retrieval—complete with interactive charts. For production, we’ll point to a reference implementation showing how to package your Neuron‑optimized LLM and encoder images in DLC and deploy them on Amazon EKS with auto‑scaling, so your inference layer scales automatically to match demand while optimizing cost‑performance.
 14 | 
 15 | #### Expanding User Interest Profiles with LLMs
 16 | 
 17 | For our experiments, we leverage the Amazon Book Reviews dataset (mohamedbakhet/amazon-books-reviews) from Kaggle, which provides real‑world user reviews and metadata for tens of thousands of books. This rich corpus lets us simulate cold‑start scenarios—where a brand‑new user has only a single review or “like”—and evaluate how well our LLM expansions bootstrap personalization.
 18 | At the core of our approach is using an LLM to enrich a new user’s profile from minimal initial data. For example, given that a user has only reviewed one sci‑fi novel, the LLM infers related sub‑topics—such as “galactic empires,” “cyberpunk dystopias,” or “space exploration”—that the user is likely to enjoy. To ensure consistency and relevance, we use structured prompts that embed the user’s existing activity into a concise instruction:
 19 | 
 20 | ```
 21 | `prompt = (`
 22 | `f"The user has shown interest in: {user_review_category}.\n"`
 23 | `"Suggest 3–5 related book topics they might enjoy.\n"`
 24 | `"Respond with a JSON list of topic keywords."`
 25 | `)
 26 | ``expanded_topics = llm.generate([prompt])[0].text`
 27 | ```
 28 | 
 29 | *Example prompt construction and LLM call. The `llm.generate` function represents an inference call to our deployed model.*
 30 | 
 31 | By constraining the LLM’s output format—asking it to return a JSON array of topic keywords—we avoid free‑form tangents and obtain a predictable list of interest expansions. Modern generative models possess broad domain knowledge and human‑like reasoning, enabling them to connect related concepts and serve as powerful cold‑start boosters by inferring deep user preferences from a single review. These synthetic interests become new signals for our recommendation pipeline, allowing us to retrieve and rank books from the Amazon Reviews corpus even with minimal user history. We experiment with LLM variants ranging from one‑billion to seventy‑billion parameters to identify which model yields the most discriminative and relevant expansions. Those findings will guide our choice of model for production and determine the size and scale of the EC2 Trainium and Inferentia instances we provision, setting us up for live user A/B tests to validate performance in real‑world settings.
 32 | 
 33 | #### Encoding User Interests and Retrieving Relevant Content
 34 | 
 35 | Once we have our expanded interests, the next step is to turn both those interests and our catalog of books into vectors that we can compare. We explore three sizes of the Google T5 encoder—base, large and XL—to see how embedding dimensionality affects matching quality. Below are the steps 
 36 | 
 37 | * Load the encoder for each size
 38 | * Encode all book summaries into a single NumPy matrix and normalize it
 39 | * Build a FAISS index on those normalized vectors for fast nearest‑neighbor search
 40 | * Encode the expanded interest text the same way and query FAISS to retrieve the top k most similar books
 41 | 
 42 | ```
 43 | from transformers import T5Tokenizer, T5EncoderModel
 44 | import faiss
 45 | import numpy as np
 46 | 
 47 | # Our dataset of book summaries
 48 | content_texts = df["review/summary"].tolist()
 49 | encoder_sizes = ["t5-base", "t5-large", "t5-xl"]
 50 | top_k = 5
 51 | 
 52 | for size in encoder_sizes:
 53 |     # 1. Load the tokenizer and encoder model for this size
 54 |     tokenizer = T5Tokenizer.from_pretrained(size)
 55 |     model = T5EncoderModel.from_pretrained(size)
 56 | 
 57 |     # 2. Encode all content into embeddings and normalize
 58 |     inputs = tokenizer(content_texts, return_tensors="pt", truncation=True, padding=True)
 59 |     outputs = model(**inputs)
 60 |     content_embs = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy().astype("float32")
 61 |     faiss.normalize_L2(content_embs)
 62 | 
 63 |     # 3. Build a FAISS index using inner-product (equivalent to cosine on unit vectors)
 64 |     index = faiss.IndexFlatIP(content_embs.shape[1])
 65 |     index.add(content_embs)
 66 | 
 67 |     # 4. Encode a single expanded interest and query the index
 68 |     interest = "space opera with political intrigue"
 69 |     enc = tokenizer([interest], return_tensors="pt", truncation=True, padding=True)
 70 |     interest_emb = model(**enc).last_hidden_state.mean(dim=1).detach().cpu().numpy().astype("float32")
 71 |     faiss.normalize_L2(interest_emb)
 72 | 
 73 |     distances, indices = index.search(interest_emb, top_k)
 74 |     recommendations = [content_texts[i] for i in indices[0]]
 75 | 
 76 |     print(f"\nTop {top_k} recommendations using {size}:")
 77 |     for title in recommendations:
 78 |         print(" -", title)
 79 | ```
 80 | 
 81 | This loop lets you compare how each encoder scale affects both the average FAISS distance (i.e. how “far apart” your interest is from the content) and the actual recommended title**s**. Swapping in a different encoder family—such as SentenceTransformers—is as simple as replacing the model and tokenizer imports.
 82 | 
 83 | #### Measuring and Improving Recommendation Quality
 84 | 
 85 | Now that we’ve generated FAISS indexes for every LLM‑encoder pairing and computed the mean distance between each “expanded interest” query and its top 10 neighbors, we know exactly how tightly or loosely each model’s embeddings cluster. The chart below shows those average distances for each combination—revealing that 1 B and 3 B models collapse to almost zero, while 8 B and 70 B models (especially with larger encoders) produce progressively higher distances, signifying richer, more discriminative signals for recommendation.
 86 | 
 87 | ![average-faiss-distance-per-llm-encoder-combo](average-faiss-distance-per-llm-encoder-combo.png)
 88 | 
 89 | The chart shows that the 1 B and 3 B models yield an average FAISS distance of zero, meaning their expanded‑interest embeddings are essentially identical and offer no differentiation. By contrast, the 8 B model produces a distance of about 0.5 with t5‑base, rising further with t5‑large and t5‑xl, which demonstrates that larger encoders capture more of the model’s nuance. The 70 B model only adds a small boost—and only with the XL encoder—so its extra cost yields limited benefit.
 90 | In practical terms, an 8 B LLM paired with a base or large T5 encoder delivers clear separation in embedding space without the higher inference time and resource usage of a 70 B model.
 91 | 
 92 | 
 93 | #### **Comparing Model and Encoder Impact on Embedding Spread**
 94 | 
 95 | To see how LLM size and encoder scale shape our embedding space, we measured—for each `(LLM, encoder)` pair—the mean FAISS distance from a representative “expanded interest” vector to its top 10 neighbors. The bar chart below plots those averages side by side. You can instantly spot that 1 b and 3 b collapse to zero, 8 b jumps to around 0.5 and rises with larger encoders, and 70 b only adds a small extra spread at the XL scale. This helps you choose the smallest combination that still gives you the embedding diversity needed for effective cold‑start recommendations.
 96 | 
 97 | ![cold-start-expansion-llm-size-vs-encoder-size](./cold-start-expansion-llm-size-vs-encoder-size.png)
 98 | 
 99 | #### Evaluating Recommendation Overlap Across Models and Encoders to Balance Consistency and Novelty
100 | 
101 | In our next analysis, we build a simple `recommend_books` helper that, for any given LLM size and encoder choice, loads the corresponding expanded‑interest DataFrame, reads its FAISS index, reconstructs the first embedding as a stand‑in query, and returns the top k book titles. Using this helper, we first measure how much each pair of encoders agrees on recommendations for a single LLM—comparing base vs. large, base vs. XL, and large vs. XL—and then, separately, how each pair of LLM sizes aligns for a fixed encoder. Finally, we focus on the 8 B model and plot a heatmap of its encoder overlaps, which shows that base and large share about 40 % of their top‑5 picks while XL diverges more—illustrating how changing the encoder shifts the balance between consistency and novelty in the recommendations.
102 | 
103 | ![encoder-overlap-for-llm](./encoder-overlap-for-llm.png)
104 | 
105 | For the 8 B model, the heatmap shows that t5_base and t5_large share 40 % of their top‑5 recommendations, t5_base and t5_xl also overlap 40 %, while t5_large vs t5_xl only overlap 20 %, indicating that the XL encoder introduces the greatest amount of novel titles compared to the other pairs.
106 | 
107 | ### Tweaking tensor_parallel_size for optimal cost performance 
108 | 
109 | To balance inference speed against resource cost, we measured how increasing Trainium tensor parallelism affects latency when expanding user interests with the Llama 3.1 3 B model on a trn1.32xlarge instance. We ran the same zero‑shot expansion workload at `tensor_parallel_size` values of 2, 8, 16, and 32. As shown in the first chart, P50 latency falls steeply—from about 2 480 ms at TP = 2 down to roughly 650 ms at TP = 16—then only inches lower to 532 ms at TP = 32. The second chart plots cost‑to‑performance ratios and makes it clear that beyond TP = 16, every doubling of parallelism roughly doubles cost for only a 17 % latency gain. 
110 | 
111 | ![latency-vs-tensor-parallelism](./latency-vs-tensor-parallelism.png)
112 | 
113 | In practice, setting `tensor_parallel_size` to 16 delivers the best trade‑off: you capture most of the speed‑up from model sharding while avoiding the sharply diminishing returns and higher core‑hour costs that come with maximal parallelism.
114 | 
115 | ![cost-efficiency-diminishing-returns-beyond-tp-16](./cost-efficiency-diminishing-returns-beyond-tp-16.png)
116 | 
117 | This figure visualizes the cost-to-performance ratio, emphasizing that TP=16 offers the most balanced efficiency before the benefits plateau.
118 | 
119 | ### Conclusion
120 | 
121 | This post showed how AWS Trainium, the Neuron SDK, and scalable LLM inference can tackle cold-start challenges by enriching sparse user profiles for better recommendations from day one.
122 | Importantly, our experiments highlight that *larger models and encoders don’t always mean better outcomes*. While they can produce richer signals, the gains often don’t justify the added cost. We found that an 8B LLM with a T5-large encoder strikes the best balance between performance and efficiency.
123 | Rather than assuming bigger is better, this approach helps teams identify the *optimal model-encoder pair*—delivering high-quality recommendations with cost-effective infrastructure.
124 | 


--------------------------------------------------------------------------------