├── .gitignore ├── README.md ├── images ├── power_limit.jpeg └── rig.jpeg └── linux └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # navi at home 2 | 3 | The name of the game is FLOPs, VRAM, mass storage, and 4 | high-bandwidth networking. 5 | 6 | _can we have navi?_ 7 | 8 | _Mom: no, we have navi at home_ 9 | 10 | ![serialexperimentslain1](https://github.com/hitorilabs/navi/assets/131238467/fbb66c0e-b6b3-4eb8-b9c4-365544ef1b72) 11 | 12 | # User Guide 13 | Github actually has a decent reading experience with markdown, the "happy path" for reading this like a blog is to click the `README.md` file to enter "code preview" and then hit the side menu button to pop out the table of contents. 14 | 15 | screenshot of markdown reader 16 | 17 | # Build Log 18 | 19 | *These are not budget build guides, we haven't been grinding normie work just to park the money in a bank.* 20 | 21 | #### 2024/02/11 22 | 23 | Just bought 1 more 3090 and another 1600W PSU (probably on it's last legs, but it was cheap so I want to play around with it) - now I'm noticing data link errors. Basically, you'll see messages like this when you watch the kernel diagnostic messages `sudo dmesg -w`. 24 | 25 | ``` 26 | [ 143.474053] pcieport 0000:00:03.1: [ 6] BadTLP 27 | [ 143.474055] pcieport 0000:00:03.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID 28 | ``` 29 | Related: [Article from PCI-SIG on Retimers](https://pcisig.com/pci-express®-retimers-vs-redrivers-eye-popping-difference) 30 | 31 | I should try out some retimers, but I also don't know how much this impacts performance + system stability for now I can only really monitor the situation until I test them out for myself. So far the impact on performance itself hasn't been very noticeable, but the actual impact might be more insidious. 32 | 33 | I am also working on a more extensive strategy guide for buying, testing, and setting up this kind of hardware. There were just too many things would've been hard to figure out if I wasn't willing to burn some money away on niche experiments. 34 | 35 | The good thing about buying a bunch of used crap is that you can actually mess up multiple times and still stay under budget. When you build your own workstation, you're basically just trying to "beat the market" which is basically: 36 | 37 | - Threadripper (>2.5K CAD) 38 | - Compatible MOBO (>1.3K CAD) 39 | - 2x A6000s (2 * >9K = >18K CAD) 40 | 41 | All-in cost you are spending upwards of 25K CAD after taxes - plus, you still had to do enough research to not completely mess up this moment. Even if you find a deal, you are forced to make some very expensive bets on hardware that you may not be familiar with yet. 42 | 43 | Some links to check out: 44 | - [Puget Systems Workstation Build](https://www.pugetsystems.com/solutions/scientific-computing-workstations/machine-learning-ai/buy-200/) 45 | - [Tim Dettemers Guide](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/#Is_it_better_to_wait_for_future_GPUs_for_an_upgrade_The_future_of_GPUs) 46 | 47 | #### 2023/11/15 48 | 49 | PSA if your boot is slow asf (ubuntu-server) for seemingly no reason, I wrote up a [gist](https://gist.github.com/hitorilabs/2ce6eabcf92ec7dd7acbcb72486aaf2e) about how having extra unmapped ports causes your machine will block your boot for several minutes to do a search. 50 | 51 | #### 2023/11/10 52 | 53 | Attempted to go through the RMA process for some products just to see what it's like. The components are probably working fine, but I just cited some obscure conditions (i.e. power limited, coil whine, etc.). 54 | 55 | EVGA - mostly no questions asked and were actually quite helpful with tech support. If you live far away from the actual repair center (RIP canada bros), you will get absolutely destroyed by shipping costs and conversion fees because they don't cover anything for you (unless maybe you gaslight them into offering it) 56 | 57 | ASUS - heard really bad things about them and I don't really want to risk them sending back my cards scratched up - or worse give me some used ones. I have an open RMA, but probably won't go through with it. 58 | 59 | **PSA** about PSUs, you should check what cables are actually provided in the box for compatibility. 60 | 61 | I have the EVGA 1600W P+ PSU, but when I RMA'd it they gave me a P2 because it was out of stock at the time. The two PSUs are similar, but the P2 actually had a completely different set of cables that weren't at all compatible. 62 | 63 | My ROG STRIX 3090 model expects 3x 8-pin PCI-E single connections (`3 connectors * 3 GPUs = 9 Total`). The support guy helped me figure out that my cards were blinking red because 3090s don't work when using both ends of split cables - you have to use them as single connections and leave one end dangling. This also implies that **running 3 GPUs is effectively the maximum** on a single PSU. 64 | 65 | The EVGA 1600W P2 and P+ PSUs only have a total of 9 PCI-E power connector slots on the back 66 | - The P2 comes with 4x 8-pin single connectors and 5x split 6 pin + (6 + 2) pin. 67 | - The P+ came with 4x 8-pin single connectors and 5x split 8 pin + (6 + 2) pin - so I could actually do the dangling cable thing and make it work. 68 | 69 | If I had to choose a PSU, I would've spent the extra cash to buy the T2 model because it has less of a coil whine and comes with the 9x 8-pin single PCI-e connectors that you'll want. 70 | 71 | (see this [video](https://www.youtube.com/watch?v=V3OeKQU8AJE) about the T2 and explaining PSUs in general) 72 | 73 | You also can't even buy extra cables online - it seems like the single 8-pin PCI-E cables are always out of stock. 74 | 75 | #### 2023/09/30 76 | 77 | **NAS Update**: Basically deleted my NAS machine and transplanted all the drives into my primary workstation and setup a regular ZFS pool + NFS server on ubuntu (pleasantly surprised that all the data is retained and it just works). I was watching a bunch of yt videos and was tricked into thinking truenas was any good. 78 | 79 | **Workstation Update**: Managed to get a 4090 so I swapped out the 3090 and cobbled together remaining parts for my sister to do blender + game dev. Anything that fits on the single card can often feel nearly 2x faster - probably a combination of higher clock + cores + software gains on ada 80 | 81 | Haven't played a game in a while, but of course I had to play cyberpunk on it - incredible. I was only vaguely keeping track of progress on the gaming/graphics side of things, but DLSS 3.5 w/ Path Tracing + Ray Reconstruction has to be pretty close to magic. 82 | 83 | #### 2023/08/30 84 | 85 | recently secured a small pile of 3090s through the local social media marketplace and it's actually surprisingly hard to compose a system that's actually capable of running them all on a single machine. 86 | 87 | rig 88 | 89 | If only I had trusted the [lambdalabs article](https://lambdalabs.com/blog/deep-learning-hardware-deep-dive-rtx-30xx) to begin with, then I wouldn't have gone through countless hours of trying to get 4 3090s running on a single 1600W PSU. Basically, the sweet spot is 3 3090s running on a machine that's outfitted with server parts, since desktop parts will severely limit your total # of PCIE lanes (although you will rarely be memory constrained on multiple PCIE 4.0 x8 slots). 90 | 91 | If you are only planning to run 2x GPU configurations as a workstation, you should consider settling for desktop parts and running them on PCIE 4.0 x8 (i.e. i9-13900K/i7-13700K both support 2x8 + 4 PCIE configurations, which leaves room for some weak networking or storage expansion). This way, you won't be dealing with used parts or old software. 92 | 93 | Type | Name | Quantity | Unit Cost 94 | -- | -- | -- | -- 95 | PSU | EVGA 1600W P+ | 1 | 551.42 96 | GPU | ASUS ROG STRIX RTX 3090 | 3 | 850.00 97 | MOBO | ROMED8-2T/BCM | 1 | 882.52 98 | RAM | Micron 16GB DDR4-3200 RDIMM 1Rx4 (MTA18ASF2G72PZ-3G2R) | 4 | 60.915 99 | STORAGE | WD_BLACK 1TB SN850X NVMe | 1 | 89.00 100 | COOLER | Noctua NH-U9 TR4-SP3 | 1 | 101.64 101 | CASE | Mining Rig Frame for 12GPU, Steel Open Air Miner | 1 | 45.19 102 | RISERS | Thermaltake TT PCI-E 4.0 Riser Cable | 3 | 112.85 103 | CPU | AMD EPYC 7302P 16 cores 3.0GHz 155W | 1 | 195.91 104 | 105 | **Total**: 4,997.89 CAD 106 | 107 | EDIT: DO NOT BUY THE P+ MODEL FOR YOUR PSU, BUY A T2 (SEE 2023/11/10) 108 | 109 | Review on pricing: 110 | - RAM - $60 for a single 16GB stick is robbery, but I couldn't find any 1Rx4 memory on ebay 111 | - RISERS - with one-day delivery on Amazon 112 | - MOBO - got lots of recommendations for this and saw a lot of vast.ai builds that use it 113 | - bought from ebay at ~$300 off from the Amazon price 114 | - main advantage is that you get 7 slots with full PCIE 4.0 x16 bandwidth 115 | - STORAGE - pretty smol, but won't be needing a ton on this machine yet because I'll be offloading to other machines 116 | - GPU - a steal considering that the sellers were a very well-off family w/ kids who were probably not thrashing these cards 117 | - CPU - a piece of crap, but I've never dealt with server parts before so I'm starting small. 118 | - The ebay guys are surprisingly reliable, I wonder where they get these parts from 🤔 119 | - I also bought a torque driver that has the 14 lbs/in setting because I saw some comments about it on discord, but it might just be a psyop. 120 | 121 | A neat article and chart on power limiting from a [Puget Systems article](https://www.pugetsystems.com/labs/hpc/NVIDIA-GPU-Power-Limit-vs-Performance-2296). This will help you control temperatures and stay comfortably under your power supply's maximum capacity. 122 | 123 | power limiting chart 124 | 125 | #### 2023/05/19 126 | 127 | ubuntu-server pilled now. 128 | 129 | Accidentally wiped my root partition and realized what I need is stability. I want my server to be up all the time and feel roughly the same as a vm on `lambdalabs`. I was also trying to put tailscale on my arch linux install, but it was missing some mystery dependency that I'll never figure out. 130 | 131 | ubuntu-server had a pretty straightforward install and sensible defaults (launch sshd without login, netplan is pretty straight forward, etc.) 132 | 133 | - with tailscale + ubuntu, I can now connect to my DL workstation anywhere and safely reboot it without losing ssh access (unless something goes terribly wrong) 134 | - UX workflow is now pretty much uniform with `lambdalabs` instances 135 | 136 | #### 2023/05/11 137 | 138 | - workstation w/ cuda 139 | - network attached storage + rsync 140 | - vs code w/ `copilot` + `remote - ssh plugin` + `jupyter notebooks` extension 141 | - `htop` + `nvtop` monitoring on my TV screen 142 | - arch btw 143 | 144 | screencap gif of sensor readings + nvtop + htop 145 | 146 | #### 2023/05/08 147 | 148 | Built and setup the the workstation over the weekend. For the record, I've never owned a GPU before and my only interaction with linux distros is through docker images on the cloud. 149 | - can I still say arch btw if I used `archinstall`... 150 | - moved all the overkill components from NAS to DL workstation 151 | - never touched networking before, so this was an 152 | enlightening experience 153 | 154 | **Deep Learning Workstation (RETIRED)** 155 | 156 | - RTX 3090 157 | - 1x 1TB NVMe (980 Pro) + 2TB NVMe drive (970 Evo Plus) 158 | - Intel i7-13700KF (16-core) 159 | - arch btw 160 | 161 | **Specifications** 162 | Type | Name | Quantity | Unit Cost 163 | -- | -- | -- | -- 164 | CPU | Intel Core i7-13700KF | 1 | 519.98 165 | COOL | Noctua NH-D15S | 1 | 99.95 166 | MOBO | Asus ROG STRIX Z690-E | 1 | 448.98 167 | RAM | Corsair Vengeance 64 GB (2 x 32 GB) DDR5-6000 | 1 | 319.99 168 | SSD | Samsung 980 Pro 1 TB M.2-2280 NVMe | 1 | 129.97 169 | SSD | Samsung 970 Evo Plus 2 TB M.2-2280 NVMe | 1 | 169.99 170 | GPU | NVIDIA RTX 3090 FE | 1 | 900.00 171 | CASE | Corsair 7000D AIRFLOW | 1 | 299.99 172 | PSU | Corsair RM850x 850W PSU | 1 | 184.99 173 | 174 | **Total**: 3073.84 CAD 175 | 176 | --- 177 | 178 | This is clearly not a cheap build, some high-level considerations that were made: 179 | - `CASE` + `MOBO` has enough space + ports for 6 drives and lots of space for fans. 180 | - `MOBO` bundled with M.2 expansion card for 4 extra slots, 2.5GbE LAN, wi-fi 181 | - `SSD` now that I know the `MOBO` came with an M.2 expansion card, I'm most likely going to invest in lots more of these NVMe drives (and give the OS it's own dedicated drives) 182 | - `RAM` leaving room to expand to 128 GB 183 | 184 | #### 2023/05/01 185 | 186 | I should just buy the machine already, I'm not completely broke. Looking through local marketplaces for honest people selling used 3090. Doing my research on all things hardware. 187 | 188 | #### 2023/04/20 189 | 190 | I have more than enough storage to dump everything I see in the foreseeable future, but no compute to make any of this productive. There are lots of gains to be made with solid networking and storage, but cloud computing platforms often don't offer much control there. I want to do a mix of inference and training, but none of the options make sense for me. 191 | - Lambda Labs never has any capacity available when I need it 192 | - For an A100, it costs `1.10 * 24 * 365 = 9636` to run an inference server 24/7 193 | - For an A10 it still costs `0.6 * 24 * 365 = 5256`, but it's not even as good as an RTX 3090 194 | - Google colab is [doing some whack stuff](https://twitter.com/thechrisperry/status/1649189902079381505) 195 | - For a V100 expect `5.45 * 0.14 = 0.76` per hour, which is laughable compared to what you get with A10s. 196 | - For an A100 expect `15 * 0.14 = 2.1` per hour, which is double Lambda Labs pricing. 197 | 198 | #### 2023/04/14 199 | 200 | Finished building and setting up TrueNAS Scale over the weekend. Honestly, not super impressed with it. 201 | - RAID storage is a psyop 202 | - Could've just invested in a bunch of NVMe drives and deal with extra storage when the time actually comes. 203 | 204 | #### 2023/04/10 205 | 206 | Waiting 2 hours for a dataset to finish downloading over wifi, 5 mins for every model to download over the internet... this is the last time I will suffer. 207 | - Ordered a bunch of parts and HDDs from Amazon to build a NAS (one-day shipping and scheduled pickup for returns feel illegal) 208 | 209 | # TODO 210 | 211 | - increase utilization on these machines 212 | -------------------------------------------------------------------------------- /images/power_limit.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hitorilabs/navi/f9c5fb224dd03867f38d7cd37a7c5dc839487d59/images/power_limit.jpeg -------------------------------------------------------------------------------- /images/rig.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hitorilabs/navi/f9c5fb224dd03867f38d7cd37a7c5dc839487d59/images/rig.jpeg -------------------------------------------------------------------------------- /linux/README.md: -------------------------------------------------------------------------------- 1 | 1. Put arch on a flash drive (on mac it's extremely simple with `dd`) 2 | 3 | https://wiki.archlinux.org/title/USB_flash_installation_medium 4 | 5 | ``` 6 | dd if=path/to/archlinux-x86_64.iso of=/dev/diskX bs=1m` 7 | ``` 8 | 9 | 2. boot into usb 10 | 3. run `archinstall` (probably use `grub`) 11 | 4. `xorg` profile just installs base linux with the packages necessary for gpu drivers 12 | 5. configure networking from here with static IP 13 | 14 | If you check `pacman -Qqe` this is pretty much what you'll see: 15 | ``` 16 | base 17 | base-devel 18 | efibootmgr 19 | grub 20 | intel-ucode 21 | linux 22 | linux-firmware 23 | nvidia 24 | xorg-server 25 | xorg-xinit 26 | ``` 27 | 28 | These are pretty great: 29 | ``` 30 | tmux 31 | vim 32 | openssh 33 | git 34 | tailscale 35 | ``` 36 | 37 | Notes: 38 | - make sure to `systemctl enable` your services (i.e. `sshd` and `tailscaled` so that they start on boot - this way you can easily remotely reboot your server) 39 | 40 | This is pretty much all you need for a capable home workstation that you can throw in your closet and access from anywhere. --------------------------------------------------------------------------------