├── 2023-01-11.md
├── 2023-01-14.md
├── 2023-02-14.md
├── 2023-03-03.md
└── README.md


/2023-01-11.md:
--------------------------------------------------------------------------------
 1 | # 2023-01-11
 2 | 
 3 | Logbook for the saga of trying to get a new version of the 6B released. Started on 2023-01-11.
 4 | 
 5 | ## Context
 6 | 
 7 | Since this is the very first logbook, here's some historical context: as of now we've released 350M, 1.3B, 2.7B and our most recent model was a 6B.
 8 | 
 9 | The 6B Colab notebook was announced a couple(?) days ago and there was apparently plenty of hype: it already became the 10th most downloaded conversational model of the month on HuggingFace, and must have reminded people of our data collection efforts, since contributed data shot up from ~189MB to ~304MB in that timeframe.
10 | 
11 | We've also gotten plenty of feedback. Numbers show that the 6B is our most popular model by far (even summing up the downloads for all the other models, the 6B still has ~7x more), so for now I plan to iterate on the 6B and once I have a good version of that, just attempt to replicate the configuration for the smaller models.
12 | 
13 | Notes about the new runs can be found below.
14 | 
15 | ## Experiment 1
16 | 
17 | ### What's new
18 | 
19 | On the data side:
20 | 
21 | - A bit of [AllenAI's SODA dataset](https://huggingface.co/datasets/allenai/soda) was added into our training data.
22 |   - Specifically, rows where `relation` is `xAttr` were fed in, with `literal` being used as the persona data and `narrative` as the scenario.
23 |   - This is an attempt to dillute some of the NSFW and possible nonsense dialogue present in the data contributed by the community, which as of now is the biggest part of our training data.
24 |   - The SODA data was trimmed as to not completely overpower our other data sources.
25 | - Sentiment classification was run over the community contributed data (I'll call this CC data from now on to save on typing lol) for data balancing purposes.
26 |   - We got feedback that characters would blush and get nervous too often. As it turns out, if we drop all episodes where characters would blush, get nervous or stutter, almost 50% of the training data disappears (ouch).
27 |   - To try and cut back on this overrepresentation, episodes with blushing/nervousness had a 70% chance of being dropped, and episodes with stuttering had a 50% chance of being dropped.
28 | - Basic deduping was done over the final training data.
29 |   - A chunk of new CC data was actually old files, but with new dialogue appended at the end. This results in duplicate episodes, which are compounded by the data augmentation introduced in the last run.
30 |   - To reduce possible impacts, episodes were deduped before being added to the training dataset.
31 | 
32 | The dataset consisted of 74,825,728 tokens.
33 | 
34 | On the training side, I talked with some other people who fine-tuned similarly-sized models and made some adjustments so our parameters resembled theirs more closely:
35 | 
36 | - LR was significantly reduced: from 4e-5 to 5e-6
37 | - Total batch size was reduced from 64 to 32
38 | 
39 | ### Results
40 | 
41 | The run finished in about 5 hours and a half. 9134 steps.
42 | 
43 | Because I'm a troglodyte of a researcher, there are no eval/test splits hence no metrics. I manually ran inference on the model to try and observe its behaviors instead.
44 | 
45 | Sampling settings used:
46 | 
47 | ```
48 | max_new_tokens=196, temperature=1, top_p=0.9, typical_p=1, repetition_penalty=1.05, top_k=0, penalty_alpha=0
49 | ```
50 | 
51 | Testing notes:
52 | 
53 | - Characters mess up their own names sometimes. For example, "Tsuki" will refer to themselves "Tsukino" every now and again.
54 | - Managed to run into a bug on the prototype UI: the last character of the user message sometimes gets appended to the beginning of the character's name in the response
55 |   - For example, if you say "How are you?", the response might be "? Bot: I'm doing well!"
56 | - Persona still kinda weak. Writing "I have very pale skin" and swapping it out with "I have tanned skin" for example will result in the character responding correctly _most_ of the time when you ask them about it directly, but they'll still get it wrong a non-trivial amount of times.
57 | - Invalid markdown every now and again (opening * but no closing, for example - easily fixable on the UI side of things).
58 | 
59 | ## Experiment 2
60 | 
61 | ### What's new
62 | 
63 | - Another ~80MB of raw CC data has come in since the last experiment.
64 | - Also, I got more feedback about character personas being weak and heavily influenced by the name of the character.
65 |   - Non-CC data already has a nice amount of variety with character names, but CC data has a substantial amount of dialogue examples coming from a much smaller amount of characters, so that might bias things.
66 |   - To work around that, I'm masking character names for CC data on this run.
67 | - Also, I noticed other redaction-related tokens that I wasn't handling properly in the CC data. Updated the code to handle the majority of them.
68 | 
69 | 89,620,480 tokens over 10940 steps.
70 | 
71 | ### Results
72 | 
73 | Mostly the same issues as the last experiment, but character persona seems to have improved noticeably. Even W++ actually works better than I expected.
74 | 
75 | ## Experiment 3
76 | 
77 | ### What's new
78 | 
79 | - More CC data
80 | - Batch size slashed in half again
81 | - LR slightly down to 4e-6
82 | 
83 | 98,935,808 tokens over 12008 steps.
84 | 
85 | ### Results
86 | 
87 | Haven't had the opportunity to test yet! Will push to HuggingFace and let the community test it out. Experiment 2 was pretty well-received and people are saying it's an improvement over the first release, so I'll probably release that as the new official version, which marks the end of this specific logbook.
88 | 


--------------------------------------------------------------------------------
/2023-01-14.md:
--------------------------------------------------------------------------------
 1 | # 2023-01-14
 2 | 
 3 | Tales of trying out SFT.
 4 | 
 5 | ## Experiment 4
 6 | 
 7 | ### What's new
 8 | 
 9 | A little more CC data compared to the last run, plus I'm using convogpt's SFT loss implementation instead of a regular CLM loss as the training objective this time.
10 | 
11 | In more straightforward terms: the model was trained to predict responses to a given input, rather than trying to learn to predict the response _and_ the input prompt all mushed together.
12 | 
13 | ### Training
14 | 
15 | As an initial note, the SFT code I was using discards any entries that go over the model's max input length. As it does this, it prints out the entry that got discarded. Right off the bat, I noticed:
16 | 
17 | - A _lot_ of Russian. Did not expect that.
18 | - Excessive repetitions ("!!!!??!!!??????......" and things of the sort)
19 | - `END_OF_DIALOG`, `START:text_remaining:2` and other random tokens of the sorts.
20 | 
21 | Might be worth cleaning up from the dataset for future runs. As for the fine-tune itself, getting it to converge is an adventure in and of itself. Some chronological notes as I'm doing this:
22 | 
23 | - Need extremely high smoothing (0.9+) on Tensorboard to be able to spot any patterns in the metrics
24 | - Original hyperparams used for the last few 6B runs don't seem to do well here
25 | - Tried bumping batch size up to 64, then down to 4, then up to 64 again with a different LR...
26 | - Final hyperparams: effective batch size of 64, LR 1e-6.
27 |   - _Lots_ of overflows. However, seems to be converging? I'll leave it for a bit.
28 |   - Actually, for some reason the run keeps crashing around the same step:
29 |     ```
30 |     Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
31 |     [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
32 |     ```
33 |     Gonna attempt to skip over it.
34 |   - ...happened again _way_ later on in the run after I had spread out checkpoint saving to save time, fml
35 | - Run finished. Took much longer than the UFT runs (~4x longer). Loss curve is spiky as hell, but it does trend down if you squint hard enough.
36 | 
37 | ### Results
38 | 
39 | Sampling settings used:
40 | 
41 | ```bash
42 | # no typical sampling/contrastive search
43 | temperature=0.85, top_p=0.9, repetition_penalty=1.03, top_k=40
44 | ```
45 | 
46 | Testing notes:
47 | 
48 | - Model seems more attentive to user input, seems it is much harder to get "ignored" with this version vs. the previous ones.
49 | - At the same time, W++ doesn't seem to work as well compared to experiments 1-3. Possible sign of overfitting?
50 | - It's really easy to get the model stuck giving out short responses, unless you take care to write out detailed messages for the greeting and example chat.
51 | 
52 | Either way, pushed this experiment to the `dev` branch on HuggingFace.
53 | 
54 | ## Experiment 5
55 | 
56 | ### What's new
57 | 
58 | - A little more CC data. Didn't pay attention to _exactly_ how much extra this time. Around 50MB I think?
59 | - Dropped all non-English data. Around ~2% entries gone, doubt this is even worth the extra preprocessing time.
60 | - Changed SFT loss calculation to look _exclusively_ at the response instead of averaging response + prompt, just to see what that does.
61 | - Experiment 4 overfitted. Having no eval metrics sure is great! Will try to take some inspiration from OpenAI's InstructGPT fine-tune hyperparameters to see how it goes, so:
62 |   - Batch size of 32
63 |   - LR of 9.65e-6, warming up for 500 steps then decaying over the rest of the epoch with a cosine scheduler.
64 | 
65 | ### Training
66 | 
67 | - Got ~70% of the way through the first attempt then run crashed due to a NCCL timeout.
68 |   - Decided to give the checkpoint a shot but generation quality was _atrocious_.
69 |   - Upon further inspection, I noticed that generations actually started off just fine, but went on for too long and started trailing off.
70 |   - After investigating the tokenization code, I noticed that EOS tokens were not being used - `\n` was being assumed as being the EOS token, which is not the case since our models are multiline. Oops.
71 | - Fixed the above and started another training run, this time with eval metrics thanks to TearGosling.
72 | 
73 | ### Results
74 | 
75 | Base settings:
76 | 
77 | ```bash
78 | # no typical sampling/contrastive search
79 | temperature=0.6, top_p=0.9, repetition_penalty=1, top_k=0
80 | ```
81 | 
82 | Testing notes:
83 | 
84 | - Temperature drastically influences how well W++ works.
85 |   - Turning it up enough from the default worsens W++ noticeably
86 |   - Turning it _down_ from the default doesn't seem to worsen it that much, but does result in boring responses
87 |   - ~0.6 might be a sweet spot? Responses are rarely blunt one-word replies, but model still seems to respect the W++ persona well enough
88 | - Repetition penalty influences W++ as well, which seems pretty obvious. Increasing it _at all_ makes W++ worse - and probably normal persona as well depending on how it's worded.
89 | - Putting negatives in the character persona doesn't seem to work that well, but I've noticed this in the previous experiments as well.
90 |   - For example, if adding `Dislikes("Classical piano music")` in the persona, the character will say they don't like classical if you ask them about it like "what do you think about classical music?", but if you instead give them a leading question or ask in a positive manner ("do you like classical music?") it'll respond positively most of the time.
91 |   - This seems to have improved somewhat in this experiment, though. Sampling settings mentioned above result in the character responding correctly most of the time.
92 | 
93 | TL;DR: More of the same. Pushing to HF for the sake of completeness, but I think any noticeable improvements to the model will likely need to come from the data and prompting instead of more fiddling with the training hyperparameters.
94 | 


--------------------------------------------------------------------------------
/2023-02-14.md:
--------------------------------------------------------------------------------
 1 | # 2023-02-14
 2 | 
 3 | ## Experiment 6
 4 | 
 5 | ### What's new
 6 | 
 7 | Lots of data. Notably:
 8 | 
 9 | - **More community contributed chat logs:** We're up to ~6GB in raw submissions.
10 | - **Instruction-following data:** Flan, chain of thought and other user-contributed instruction following datasets.
11 | - **Other fiction/roleplay data:** These are being processed and, since they're also much lesser in quantity, will be added later into the training data (possibly at a higher learning rate, too)
12 | - **Knowledge grounding data:** To teach the model how to take any sort of external knowledge manually fed in at inference time (e.g. world info on Kobold, long-term memories fetched from a vector search database, internet search results) and convert it into a conversational response.
13 | 
14 | ### Training
15 | 
16 | #### Overview
17 | 
18 | Like the previous supervised fine-tune runs, good hyperparameters were again tricky to find. I ended up doing a few sweeps over the weekend to find a decent configuration. In the end, I went with:
19 | 
20 | - Effective batch size of 256
21 | - Learning rate of 0.98e-5
22 | - Constant learning rate, with 24 warmup steps
23 | 
24 | Another problem is that we are no longer dealing with a training set in the range of tens/hundreds of MBs. Instead, the processed training set (before tokenization) now clocks in at around 14GB.
25 | 
26 | Not counting all the small annoyances that come with having to deal with that much data (transferring over the network, processing time, memory limits, etc.), we now have the problem of training times. A rough estimate would put us at around a month of training time with the current setup (4 RTX A6000s).
27 | 
28 | To make this sting a little less, I'll be splitting the training set into ten parts and gradually releasing "work in progress" checkpoints after training on each of them. This will also allow us to possibly pivot on our data choices if, after testing the latest checkpoint, we notice any sort of problems with the generations.
29 | 
30 | That being the case, let's get started!
31 | 
32 | #### Part 1/10
33 | 
34 | **Training notes:**
35 | 
36 | - Loss curves look good. Evaluation loss trending down consistently, from 1.438 to 1.203.
37 | - Towards the end (last ~60k training examples), evaluation loss seemed to stagnate and slightly bounce back up a few times.
38 |   - This is about in line with numbers I've heard from other people (1GB of data for fine-tuning a 6B), but I'm curious to see what more data will do anyways.
39 | - Done! 682701 training examples seen after approximately 2 days and 21 hours of runtime.
40 | 
41 | **Testing notes:**
42 | 
43 | Sampling settings:
44 | 
45 | - **Temperature:** Usually 0.5 to 0.65, bumping up to 0.8-1.0 every now and again to see what happens.
46 | - **Top P sampling:** Either 0.9 or disabled (1.0).
47 |   - Disabling this _seems_ to give the model a little more freedom to be creative when coupled with a slightly higher temperature(?).
48 | - **Repetition penalty:** 1.0 and 1.1.
49 |   - Looks like the model is surprisingly good at not repeating itself even at 1.0, but every now and again in a more difficult situation it'll fall back to asking a question it _just_ asked you or something of the sorts. Bumping to 1.1 seems to fix that in most cases.
50 | 
51 | Results:
52 | 
53 | - From what I could tell, for assistant-like characters, this checkpoint performs _much_ better than v2 or v6. Obviously it's no ChatGPT or GPT-JT though, trying to ask for more detail or a different formatting for example doesn't work at all.
54 | - Knowledge grounding data certainly had an effect: adding an extra turn before the response with `Relevant Knowledge: [something here]` results in the model rewording the knowledge into something more conversational.
55 |   - Unfortunately this seems to be really hit or miss for anything that's not a stated fact.
56 |   - For example: if you ask about a band and give the model a Wikipedia excerpt, it'll pick out some phrases to reply to your question with.
57 |   - However, if you give it some information about the character/yourself as if it's a memory, it's a coin flip as to whether that'll correctly guide the conversation or whether the model will ignore it outright, or get confused and think it's conversation history.
58 | - The curse of the short responses is still here. If you don't take the time to write out a long opening message and some detailed example conversations, the model will quickly trend towards short generations, like talking to someone over an instant messaging app.
59 |   - The messages themselves are actually fine and even seem more engaging than what v6 spits out, but anyone expecting more novel/story-driven dialog or things of the sorts will likely be disappointed.
60 |   - To alleviate this, I'm considering dropping training examples where the responses are too short for part 2.
61 | 
62 | I'll release this as an experimental version over at HuggingFace, and get to work on part 2 of training.
63 | 
64 | **Update:** As of now, the v7 run has been aborted. Check the [next logbook entry](./2023-03-03.md) for more details.
65 | 


--------------------------------------------------------------------------------
/2023-03-03.md:
--------------------------------------------------------------------------------
  1 | # 2023-03-03
  2 | 
  3 | ## Experiment 7
  4 | 
  5 | Bit of a hiatus since the last entry because our main training machine was down for over a week.
  6 | 
  7 | Either way, feedback for v7 (experiment 6, for those who are understandably confused) was overwhelmingly consistent: the model degenerates to overly-brief responses way too easily. The ~gigabyte of data that the first v7 checkpoint saw was representative of the entire dataset, so I don't think training for around a month on the remaining data makes any sense given the problems we've already spotted.
  8 | 
  9 | That being the case, I have opted to abandon v7 for now and instead try again with a new version of the dataset.
 10 | 
 11 | ### What's new
 12 | 
 13 | New data and more strict filtering. Notably:
 14 | 
 15 | - **Short responses are aggressively trimmed out of the dataset.**
 16 |   - Any training examples where the median response length falls under a certain threshold are dropped completely, and as median response length increases, the chances of dropping the training example diminish. This will hopefully skew the model towards longer responses by default.
 17 | - The above dropped the size of our dataset by a few gigabytes, so to compensate we:
 18 |   - Added some more instruction following data.
 19 |   - Incorporated all freshly submitted community contributed chat logs.
 20 | 
 21 | ### Training
 22 | 
 23 | #### Overview
 24 | 
 25 | Nothing new this time, kept everything the same as the previous experiment. So:
 26 | 
 27 | - Setup:
 28 |   - 4 x 48GB RTX A6000s
 29 |   - DeepSpeed with ZeRO stage 1, bf16
 30 | - Hyperparameters:
 31 |   - Effective batch size of 256
 32 |   - Learning rate of 0.98e-5
 33 |   - Constant learning rate, with 24 warmup steps
 34 | 
 35 | #### Part 1/10
 36 | 
 37 | **Training notes:**
 38 | 
 39 | So slow! Training loop ran for 3d 8h 58m 42s. Definitely think this could be optimized. Either way, loss curves look good.
 40 | 
 41 | **Testing notes:**
 42 | 
 43 | Sampling settings:
 44 | 
 45 | - **Temperature:** Usually 0.5 to 0.7, bumping up to 0.8-1.0 every now and again to see what happens.
 46 | - **Top P sampling:** Either 0.9 or disabled (1.0).
 47 | - **Top K samlping:** Either 0 or 40.
 48 | - **Repetition penalty:** Between disabled (1.0) and 1.2.
 49 |   - Sometimes keeping the repetition penalty disabled works great. Eventually the model will catch on to some mannerism and start repeating it though, so defaulting to something higher than 1.0 seems ideal.
 50 | 
 51 | Results:
 52 | 
 53 | - Ultra-short responses like "...", "Yes.", "Hm?" seem a lot rarer now!
 54 | - If using an assistant-like character and trying to use the model like it's instruction-tuned:
 55 |   - Zero-shot doesn't work _that_ well (needs quite a few regens for something good to come up)
 56 |   - Few-shot works surprisingly well
 57 |   - Anything that requires a long, complete response (e.g. cooking recipe, functional piece of code) is prone to hallucinations
 58 |   - Would like to try the above with external knowledge grounding to see whether hallucinations become rarer
 59 | - Didn't play with the model for very long. Will release on HF and see what the community has to say.
 60 | 
 61 | #### Part 2/10
 62 | 
 63 | **Training notes:**
 64 | 
 65 | Feedback from the community is that while short responses are not as common as in v7, they still happen. We already have enough checkpoints for people who need short responses (e.g. for Discord bots or something), so I decided to try being really aggressive about this, and just pruned off any responses with less than 60 words (over 50% of the dataset) for this part to try and balance out all the shorter data the model saw in part one.
 66 | 
 67 | **Testing notes:**
 68 | 
 69 | Same sampling settings as for part one. First impressions:
 70 | 
 71 | - Longer responses seem a _little_ more common. The small effect is understandable, given that the model has already seen many more short messages in the previous part than longer messages in this one.
 72 | - A big problem with longer responses though is that there's more room in them for the characters to say something wrong/inconsistent/out of place/etc.
 73 |   - Contrastive search noticeably reduces the chances of this happening, but it's not available everywhere.
 74 | - I saw one report of messages getting cut off on v8 part one, and indeed I think I saw this happen here: every now and again there will be an unbalanced `*` or `"` pair in the generated response.
 75 |   - Will write something to look for these in the training data and prune any incorrectly formatted training examples from the dataset, if they exist.
 76 | - Will release on HF and wait for some community feedback while I adjust the training data for part three.
 77 | 
 78 | #### Part 3/10
 79 | 
 80 | **Training notes:**
 81 | 
 82 | Feedback for part 2/10 was generally positive, with the cons basically matching up with what I found while testing. I'm hoping that the rambling will be improved by the model seeing more long-form data, so I trimmed the training data for part 3 exactly the same as I did for part 2.
 83 | 
 84 | **Testing notes:**
 85 | 
 86 | I'm a little short on time, so I loaded up the model just to make sure nothing was horribly broken. Seems OK, pushing to HF to see what everyone thinks.
 87 | 
 88 | #### Part 4/10
 89 | 
 90 | **Training notes:**
 91 | 
 92 | Interestingly enough, feedback for part 3 was (mostly) negative despite no changes in hyperparameters or training data distribution. For now, I'm going to chalk this up to random chance - maybe the data in part 3 was somehow not representative of the full set, and leaned more towards one data source or something of the sorts.
 93 | 
 94 | Either way, from my very brief testing of part 3 it looked like responses were plenty long by default, so I've dialed the filtering back a little: instead of dropping responses that were under 60 words, I've only dropped the ones below 50 words for this part.
 95 | 
 96 | **Testing notes:**
 97 | 
 98 | Metrics indicate we've hit diminishing returns - evaluation loss barely moved over 209k optimization steps. Testing the model itself though, the responses are still decently sized and descriptive, and the tendency for it to ramble and bring up out of place stuff has lessened quite noticeably even without contrastive search.
 99 | 
100 | As with the feedback for part 3, I'm unsure how much of this is down to dumb luck vs. the extra training data having an effect.
101 | 
102 | Uploaded to HF. If community feedback is negative/neutral for this part:
103 | 
104 | - I'll assume that there won't be significant benefits to training v8 all the way to completion, and will shift priorities around.
105 | - I might keep training it, but only between other experiments (so the GPUs aren't sitting there just idling)
106 | - In the meantime, I'd like to experiment with:
107 |   - Drastically different ways of formatting our prompts during training time, taking inspiration from Chain of Hindsight fine-tuning ([paper](https://arxiv.org/abs/2302.02676), [code](https://github.com/lhao499/CoH), [summary in the form of a Twitter thread](https://nitter.net/haoliuhl/status/1630696378325413888))
108 |   - Model/pipeline parallelism + PEFT + 8-bit training to see if we can scale past 6B.
109 |     - Lots of new inference optimizations are showing up recently, so people are now able to run bigger models with the same hardware, and we'll likely be able to scale past 6B on Colab too.
110 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # logbooks
2 | 
3 | This is where we keep notes about our training runs. Each file is named based on the date it was started.
4 | 


--------------------------------------------------------------------------------