├── Assets ├── Database-workflow.png ├── Database.png ├── WoC_Logo.png └── reuse_DFD.png ├── Database.dia ├── LICENSE ├── README.md ├── ShellGuide.md └── wochardware.md /Assets/Database-workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/Database-workflow.png -------------------------------------------------------------------------------- /Assets/Database.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/Database.png -------------------------------------------------------------------------------- /Assets/WoC_Logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/WoC_Logo.png -------------------------------------------------------------------------------- /Assets/reuse_DFD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/reuse_DFD.png -------------------------------------------------------------------------------- /Database.dia: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Database.dia -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | World of Code Infrastructure includes 2 | ===================================== 3 | a) Copyright (c) (2018, 2019, 2020, 2021, 2022, 2023, 2024) Audris Mockus 4 | World of Code software used to discover retrieve, clean, cross-reference, 5 | query, and analyse open source version control data is licensed under Mulan PSL v2. 6 | You can use this software according to the terms and conditions of the Mulan PSL v2. 7 | You may obtain a copy of Mulan PSL v2 at: http://license.coscl.org.cn/MulanPSL2 8 | THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, 9 | EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, 10 | MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. 11 | See the Mulan PSL v2 for more details. 12 | 13 | ===================================== 14 | 15 | b) The metadata including the collections and relationships among the source code 16 | is licensed under Creative Commons Attribution 4.0 International license. Please see full details 17 | at: https://creativecommons.org/licenses/by/4.0/ 18 | 19 | Furthermore, by accessing the data in WoC infrastructure, you agree with the Ethical Charter for using the 20 | archive data (see, e.g., https://www.softwareheritage.org/legal/users-ethical-charter/). 21 | 22 | ===================================== 23 | 24 | c) The actual source code in the collection retains the original license 25 | of the specific source code/version. 26 | 27 | ===================================== 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tutorial: World of Code (WoC) Basics [DEPRICATED] 2 | 3 | ![WoC_logo](/Assets/WoC_Logo.png) 4 | 5 | # Please see new tutorial docs [here..](https://worldofcode.org/docs/#/) 6 | 7 | ## Overview 8 | 9 | World of Code (WoC) is a large-scale infrastructure designed to mine and analyze the entirety of open-source software ecosystems. It aggregates data from millions of repositories across various platforms (e.g., GitHub, GitLab) and provides cross-references between authors, projects, commits, and files. This enables researchers and developers to study software development trends, dependencies, and code sharing at a global level. WoC is essential for researchers looking to examine the evolution and structure of open-source ecosystems, supporting analyses in software supply chains, developer behavior, and code reuse. 10 | 11 | ## Important Links 12 | 13 | 1. [New Tutorial Documentation](https://worldofcode.org/docs/#/) 14 | 15 | 2. [WoC Registration Form](https://docs.google.com/forms/d/e/1FAIpQLSd4vA5Exr-pgySRHX_NWqLz9VTV2DB6XMlR-gue_CQm51qLOQ/viewform?vc=0&c=0&w=1&flr=0&usp=mail_form_link): To request access to our servers 16 | 17 | 3. [WoC Structure and Its Elements Video](https://youtu.be/c0uFPwT5SZI) 18 | 19 | 4. [Tutorial Recording from 2022-10-27](https://drive.google.com/file/d/1ytzOiOSgMpqOUm2XQJhhOUAxu0AAF_OH/view?usp=sharing) and [Older Tutorial Recoding (possibly obsolete) from 2019-10-15](https://drive.google.com/file/d/14tAx2GQamR4GIxOc3EzUXl7eyPKRx2oU/view?usp=sharing) 20 | 21 | 5. [WoC Website](https://worldofcode.org) 22 | 23 | 6. [WoC Discord](https://discord.gg/fKPFxzWqZX): Get updates or ask questions related to WoC 24 | 25 | ## Additional Resources 26 | 27 | 1. [WoC Shell Guide](https://github.com/woc-hack/tutorial/blob/master/ShellGuide.md): A brief guide on how to use bash and other related tools 28 | 29 | 2. [Unix Tools: Data, Software and Production Engineering](https://courses.edx.org/courses/course-v1:DelftX+UnixTx+1T2020/course/): Consider auditing this Massive Open Online Course (MOOC) if you are not comfortable working in the terminal or working with shell scripting 30 | 31 | ## Before You Start.. 32 | ### Step 1. Requirements to Access the da Server(s) 33 | 34 | To register for the hackathon/tutorial, please 35 | generate a ssh public key. See instructions below. 36 | 37 | For macOS and Unix users, the instructions below would work. Still, for Windows users, the best option is to enable [Ubuntu Shell](https://winaero.com/blog/how-to-enable-ubuntu-bash-in-windows-10) or [install Linux on Windows with WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10) first, then follow instructions for Unix/macOS. 38 | Alternatively, you may use the [OpenSSH Module for PowerShell](https://www.techrepublic.com/blog/10-things/how-to-generate-ssh-keys-in-openssh-for-windows-10/) or [Git-Bash](https://docs.joyent.com/public-cloud/getting-started/ssh-keys/generating-an-ssh-key-manually/manually-generating-your-ssh-key-in-windows#Git-Bash) as an alternative option. 39 | 40 | To generate a ssh key, open a terminal window and run the `ssh-keygen` command. Once completed, it produces the `id_rsa.pub` and `id_rsa` files inside your $HOME/.ssh/ folder. 41 | To view the newly generated *public key*, type: 42 | 43 | ``` 44 | cat ~/.ssh/id_rsa.pub 45 | ``` 46 | 47 | You will need to provide this ssh *public key* when you complete the **WoC Registration Form** (step 3), as the form will ask you for the contents of `id_rsa.pub` and your **GitHub** and **Bitbucket** handles (step 2). You will receive a response to the email you provide in the form once your account is set up (more details below). 48 | 49 | Set up your `~/.ssh/config` file so that you can log in to one of the da servers without having to fully specify the server name each time: 50 | 51 | ``` 52 | Host * 53 | ForwardAgent yes 54 | 55 | Host da0 56 | Hostname da0.eecs.utk.edu 57 | Port 443 58 | User YourUsername 59 | IdentityFile ~/.ssh/name_of_priv_key 60 | ``` 61 | Please note that access to the remaining servers is similarly available. da2 and da3 have ssh port 22 (both are running the worldofcode.org web server on the https port 443) 62 | 63 | *Your_UserName* is the login name you provided on the signup form. With the config setup, logging in becomes as simple as typing `ssh da0` in your terminal. 64 | 65 | ### Step 2. GitHub and Bitbucket Accounts Setup 66 | 67 | If you dont have these already, please set up an account on both 68 | GitHub and Bitbucket (these will be needed to invite you to the 69 | relevant repositories on GitHub & Bitbucket). 70 | * [GitHub Sign-up](https://github.com/pricing) 71 | * [Bitbucket Sign-up](https://bitbucket.org/account/signup/) 72 | 73 | ### Step 3. Request for Access 74 | 75 | Users may access our systems/servers by obtaining a WoC account. You may do so by registering for an account through the [WoC Registration Form](https://docs.google.com/forms/d/e/1FAIpQLSd4vA5Exr-pgySRHX_NWqLz9VTV2DB6XMlR-gue_CQm51qLOQ/viewform?vc=0&c=0&w=1&flr=0&usp=mail_form_link). We strive to provide access to new users the same day you fill out the form, but in the worst-case scenario, please allow up to 1 day for the account creation. 76 | 77 | ## Tutorial Objectives 78 | 79 | Prepare for the hackathon or perform research, make sure connections work, get familiar with the basic functionality and potential of WoC, and start thinking about how to investigate global relationships in open source. 80 | 81 | ### WoC Objectives 82 | 83 | Do the hard work to enable research on global properties of (Free, Libre, and Open Source Software) FLOSS: 84 | 85 | * Census of all FLOSS 86 | - What is out there, of what kind, how much 87 | - Ability to select projects/developers/APIs for natural experiments/other empirical studies 88 | * Provide FLOSS-wide relationships 89 | - Technical dependencies (to run applications) 90 | - Tool dependencies (to build/test applications) 91 | - Code copying 92 | - Knowledge (and people) migration 93 | - API use and spread over time 94 | * Data Cleaned/Augmented/Contextualized 95 | - Correction: Authors/Forks/Outliers 96 | - Augmentation: Dependencies/Linking to other data sources 97 | - Context: project types/expertise 98 | * Big Data Analytics: Map entities to all related entities efficiently 99 | * Timely: Targeting < 1 Quarter old analyzable snapshot of the entire FLOSS 100 | * Community run 101 | - Hackathons help determine the community needs 102 | - [Hackathon Schedule](https://github.com/woc-hack/schedule) 103 | * How to participate? 104 | - [Hackathon Registration Form](http://bit.ly/WoCSignup) 105 | - If you can not attend the hackathon but just want to try out WoC, please fill the hackathon form but indicate in the topic section is that you do not plan to attend the hackathon. 106 | 107 | ### What WoC Contains 108 | 109 | ![Workflow](https://github.com/woc-hack/tutorial/blob/master/Assets/Database-workflow.png) 110 | ![Content: Commits., trees, blobs, projects, authors](https://github.com/woc-hack/tutorial/blob/master/Assets/Database.png) 111 | 112 | ### Related background reading 113 | 114 | - [About WoC](https://bitbucket.org/swsc/overview/raw/master/pubs/WoC.pdf) 115 | - [Overview of the Software Supply Chains](https://bitbucket.org/swsc/overview/src/master/README.md) 116 | - [Details on WoC storage/APIs](https://bitbucket.org/swsc/lookup/src/master/README.md) 117 | - [Fun Facts](https://bitbucket.org/swsc/overview/src/master/fun/README.md) 118 | 119 | ## Activity 1: Access to da Server(s) 120 | 121 | Log in: `ssh da0`. 122 | 123 | Once you are in a da server, you will have an empty directory under `/home/username` where you can store your programs and files: 124 | ``` 125 | -bash-4.2$ pwd 126 | /home/username 127 | -bash-4.2$ 128 | ``` 129 | 130 | Set up your shell: 131 | ``` 132 | -bash-4.2$ echo 'exec bash' >> .profile 133 | -bash-4.2$ echo 'export PS1="\u@\h:\w>"' >> ~/.bashrc 134 | -bash-4.2$ . .bashrc 135 | [username@da0]~% 136 | ``` 137 | 138 | You can also login to other da servers, but first need to set up an ssh key on these systems: 139 | ``` 140 | [username@da0]~% ssh-keygen 141 | Generating public/private rsa key pair. 142 | Enter file in which to save the key (/home/niravajmeri/.ssh/id_rsa): 143 | Enter passphrase (empty for no passphrase): 144 | Enter same passphrase again: 145 | Your identification has been saved in /home/niravajmeri/.ssh/id_rsa. 146 | Your public key has been saved in /home/niravajmeri/.ssh/id_rsa.pub. 147 | The key fingerprint is: 148 | SHA256:/UoJkpnx5mn8jx4BhcnQRUFfPq4qmC1MVLRJSjpYnpo niravajmeri@da0.eecs.utk.edu 149 | The key's randomart image is: 150 | +---[RSA 2048]----+ 151 | | . o=o**. . | 152 | | + + o*+ . o | 153 | | . = o.+ . o | 154 | | o ..* o . . | 155 | | E .= S o . | 156 | | .= o + . | 157 | | o += + o | 158 | | =.oo = | 159 | | . o*.. | 160 | +----[SHA256]-----+ 161 | ``` 162 | Once the key is generated add it to your .ssh/auhorized_keys 163 | ``` 164 | [username@da0]~% cat .ssh/id_rsa.pub >> .ssh/authorized_keys 165 | ``` 166 | 167 | Now you can login to da4: 168 | ``` 169 | [username@da0]~% ssh da4 170 | [username@da4]~% 171 | ``` 172 | 173 | ### Exercise 1 174 | 175 | Log in to da0 and clone two repositories that contain APIs to access WoC data 176 | ``` 177 | [username@da0]~% git clone https://bitbucket.org/swsc/lookup 178 | [username@da0]~% git clone https://github.com/ssc-oscar/oscar.py 179 | ``` 180 | 181 | Log in to da4 from da0: 182 | ``` 183 | [username@da0]~% ssh da4 184 | [username@da4]~% ls 185 | ... 186 | [username@da4]~% exit 187 | [username@da0]~% 188 | ``` 189 | 190 | ### Important Note 191 | 192 | Make sure to access these directories and execute a `git pull` frequently to ensure you are working with latest updates. 193 | 194 | ## Activity 2: Shell APIs - Basic Operations 195 | 196 | Shell APIs might be useful to accsess content of the commit, trees, blobs, 197 | to calculate the diff produced by a commit, etc. 198 | 199 | For more examples [see full API](https://bitbucket.org/swsc/lookup/src/master/README.md). 200 | 201 | Lets look at the commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a: 202 | 203 | ``` 204 | [username@da0]~% echo 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a | ~/lookup/showCnt commit 3 205 | tree 464ac950171f673d1e45e2134ac9a52eca422132 206 | parent dddff9a89ddd7098a1625cafd3c9d1aa87474cc7 207 | author Warner Losh 1092638038 +0000 208 | committer Warner Losh 1092638038 +0000 209 | 210 | Don't need to declare cbb module. don't know why I never saw 211 | duplicate messages.. 212 | ``` 213 | 214 | This commit has a tree and a parent commit and is created by 'Warner Losh '. 215 | (parameter 3 defines that raw output needs to be produced) 216 | 217 | Lets inspect the tree (the root folder of the project) for the first and last file: 218 | ``` 219 | [username@da0]~% echo 464ac950171f673d1e45e2134ac9a52eca422132 | ~/lookup/showCnt tree | awk 'NR==1; END{print}' 220 | 100644;a8fe822f075fa3d159a203adfa40c3f59d6dd999;COPYRIGHT 221 | 040000;6618176f9f37fa3e62f2efd953c07096f8ecf6db;usr.sbin 222 | ``` 223 | 224 | We may also want inspect the first element in the tree (blob representing file COPYRIGHT). We limit the output to the first two lines only: 225 | ``` 226 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/showCnt blob | head -n 2 227 | # $FreeBSD$ 228 | # @(#)COPYRIGHT 8.2 (Berkeley) 3/21/94 229 | ``` 230 | 231 | ### Important Note 232 | 233 | When wanting to get the content of many objects (or look up 234 | values for many keys), please use a single function invocation and 235 | provide multiple keys/sha1s as standard input since each call to showCnt 236 | and getValues may involve ssh to another da server (where the data 237 | resides). 238 | To separate content of separate blobs, you can ask showCnt to put 239 | output on a single line, for example: 240 | ``` 241 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/showCnt blob 1 242 | ``` 243 | This command will produce a single line output, starting from sha1: 244 | ``` 245 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;IyAkRnJlZUJTRCQKIwlAKCMpQ09Q.... 246 | ``` 247 | The content of the blob is base64 encoded (use python's base64.b64decode). 248 | 249 | ### Exercise 2 250 | 251 | Determine the author of the parent commit for commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a 252 | 253 | Hint 1: parent commit is listed in the content of commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a above 254 | ``` 255 | [username@da0]~% echo dddff9a89ddd7098a1625cafd3c9d1aa87474cc7 | ~/lookup/showCnt commit 256 | ``` 257 | 258 | ### Summary for Activity 2 259 | 260 | Synopsis: 261 | ``` 262 | ~/lookup/showCnt commit|tree|blob 263 | ``` 264 | reads from the standard input sha1 of the corresponding objects and 265 | prints the content of these objects. 266 | 267 | ## Activity 3: Investigate the Maps 268 | 269 | We see the content of the copyright file above. Such files are often copied verbatim. Lets determine the first author who has created it (irrespective of a repository). 270 | WoC has created this relationship and stored in b2fa (Blob to First Author) map: 271 | 272 | ``` 273 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/getValues b2fa 274 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;1072910122;Warner Losh ;121f970412fec7f9af0352a9b4ce8dca43bdb59e 275 | ``` 276 | 277 | It turns out that it was created by commit 121f970412fec7f9af0352a9b4ce8dca43bdb59e done by what appears to be the same author on unix second 1072910122. 278 | 279 | What is b2fa? The letters signify what keys (b - Blob) and values 280 | (fa - first author) mean. As in natural sentence some decontextualization is needed in rare cases as this because f generally stands for file. Literally, that would mean b2fa is blob to file and author. As the number of objects and maps will multiply, single letters will not do and full word parsing will be used. 281 | 282 | **Primary Objects:** 283 | 284 | * a = Author (A - aliased author) 285 | * b = Blob (b2c map will become obsolete as of version U as one can get more info from b2tac) 286 | * c = Commit, cc - child commit and pc - parent commit 287 | * f = File (occasionally its an adjective modifying the following object as in fa or First Author) 288 | * p = Project (P - deforked project) 289 | * t = Time (unix unsigned long in UTC) 290 | * g = gender 291 | 292 | **Capital Version** - simply means that the data has been corrected: 293 | * A = aliased version (see https://arxiv.org/abs/2003.08349); any organizational and group IDs, bot IDs, as well as author IDs, that do not 294 | contain sufficient info to alias are excluded. 295 | * P = deforked project (via Leuwen community 296 | detection on commit / repo bi-graph: https://arxiv.org/abs/2002.02707) 297 | 298 | We can inspect the relationships between a and A and also between p 299 | and P: 300 | ``` 301 | [username@da0]~% echo 'Warner Losh ' | ~/lookup/getValues a2A 302 | Warner Losh ;imp 303 | [username@da0]~% 304 | [username@da0]~% echo 'imp ' | ~/lookup/getValues A2a | tr ";" "\n" | head -n 3 305 | imp 306 | M. Warner Losh 307 | M. Warner Losh 308 | ``` 309 | 310 | Going back to blob we may ask if this blob has been widely copied as would be expected for copyright files. We can use b2tac to obtain a sha1's blob to time, author, and commit. The following example pipes the output to only see the first entry: 311 | 312 | ``` 313 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/getValues b2tac | cut -d ";" -f1-4 314 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;1072910122;Warner Losh ;121f970412fec7f9af0352a9b4ce8dca43bdb59e 315 | ``` 316 | 317 | b2tac (blob to time, author, commit) shows the numerous commits that introduced that blob in all repositories. We can further use commit to project map (c2p) 318 | to identify all associated projects: 319 | 320 | ``` 321 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/getValues b2tac | awk -F \; '{for(i=4;i' for the commit we have investigated. 330 | Can we find what other commits Warner has made? (The following output is limited to three commits only): 331 | ``` 332 | [username@da0]~% echo 'Warner Losh ' | ~/lookup/getValues a2c | tr ";" "\n" | head -n 4 333 | Warner Losh 334 | 0000ce4417bd8d9a2d66a7a61393558d503f2805 335 | 000109ae96e7132d90440c8fa12cb7df95a806c6 336 | 0001246ed9e02765dfc9044a1804c3c614d25dde 337 | ``` 338 | 339 | In addition to variable-length records (key;val1,;val2;...;valn), 340 | the output can be produced as a flat table (key;val1\nkey;val2\n...\nkey;valn) 341 | using -f option: 342 | ``` 343 | [username@da0]~% echo 'Warner Losh ' | ~/lookup/getValues -f a2c | head -n 5 344 | Warner Losh ;0000ce4417bd8d9a2d66a7a61393558d503f2805 345 | Warner Losh ;000109ae96e7132d90440c8fa12cb7df95a806c6 346 | Warner Losh ;0001246ed9e02765dfc9044a1804c3c614d25dde 347 | Warner Losh ;00014b72bf10ad43ca437daf388d33c4fea73df9 348 | Warner Losh ;000153916157b29a14b65fa3efeff4e3788e1b0e 349 | ``` 350 | 351 | In addition to the random lookup, the maps are also stored in flat 352 | sorted files and this format is preffered (faster) when 353 | investigating over two hundred thousand items or an entire WoC. 354 | For example, find commits by any author named Warner (a similar 355 | task would be to find all blobs or commits involving a c-language 356 | file ".c" or a README file "README"): 357 | ``` 358 | [username@da0]~% zcat /da7_data/basemaps/gz/a2cFull.V3.0.s | grep 'Warner Losh' 359 | ``` 360 | As described below, the maps are split into 32 (or 128) parts to enable parallel search. 361 | Full.V3.0 means that we are looking at a complete extract at version V3.0. 362 | 363 | As versions keep being updated, and data no longer fits on a single server, 364 | a more flexible way to run the same command would be 365 | ``` 366 | [username@da0]~% zcat /da?_data/basemaps/gz/a2cFull.V3.?.s | grep 'Warner Losh' 367 | ``` 368 | In other words, we look for the file on any of the servers and selecting an arbitrary 369 | version of the datbase. 370 | 371 | ### Exercise 3 372 | 373 | a) Find all files modified by author id 'Warner Losh ' 374 | 375 | Hint 1: What is the map name? 376 | 377 | Author ID to File or a2f 378 | ``` 379 | [username@da0]~% echo 'Warner Losh ' | ~/lookup/getValues a2f 380 | ``` 381 | 382 | Find all commits developers who have your last and your first name: 383 | 384 | Hint 1: use wc (word count), e.g. (example takes a long time to compute), 385 | 386 | ``` 387 | [username@da0]~% zcat /da0_data/basemaps/gz/a2cFull*.s | grep -i 'audris' | grep -i 'mockus' | wc -l 388 | ``` 389 | 390 | b) Find all files modified by all author IDs used by a developer 'Warner Losh ' 391 | 392 | Hint 1: What is the map name? 393 | A represents all author IDs so we first get the group name: 394 | ``` 395 | echo 'Warner Losh ' | ~/lookup/getValues a2A 396 | Warner Losh ;imp 397 | ``` 398 | 399 | and then use it to get all files via A2f 400 | ``` 401 | [username@da0]~% echo 'imp ' | ~/lookup/getValues A2f 402 | ``` 403 | 404 | ### Summary for Activity 3 405 | 406 | For any key provided on standard input, a list of values is provided 407 | ``` 408 | ~/lookup/getValues [-f] a2c|c2dat|b2ta|b2fa|c2b|b2f|c2f|p2c|c2p|c2P|P2c 409 | ``` 410 | option -f replaces one output line per input line into the number of lines corresponding to the number of values. 411 | (or single-value maps such as c2dat, b2fa) -f makes no sense as it prints distinct fields on separate lines) 412 | 413 | Also, only the first column of the input is considered as the key, other fields are passed through, e.g., 414 | ``` 415 | [username@da0]~% echo 'Warner Losh ;zz' | ~/lookup/getValues -f a2c| head -n 3 416 | Warner Losh ;zz;0000ce4417bd8d9a2d66a7a61393558d503f2805 417 | Warner Losh ;zz;000109ae96e7132d90440c8fa12cb7df95a806c6 418 | Warner Losh ;zz;0001246ed9e02765dfc9044a1804c3c614d25dde 419 | ``` 420 | 421 | ## Actvity 4: Exploring the State of a Repo at the Last Commit 422 | 423 | Lets suppose we only care for the last version of the files in a project, e.g last version of readme. 424 | lb2f (last blob to file) provides this relationship 425 | 426 | ``` 427 | [username@da0]~% zcat /da?_data/basemaps/gz/lb2fFullV0.s | grep -i readme | head -n 5 428 | 00000057bfb6f79bdfd129f113533f9ada77cbba;/README.md 429 | 000000ad43fb50661d0f8ba20035f8f8a62b28b1;/README.md 430 | 000000c4ca807de513cd601810522141ed8347bf;/Day-92 Collection/readMe 431 | 000001222e62cd97679e0ed087c74037bab8f848;/README.md 432 | 0000013e32eb5f7497750cf652cfd540f23abb3e;/README.md 433 | ``` 434 | 435 | To get projects we just need to join it with b2P 436 | 437 | ``` 438 | [username@da0]~% zcat /da?_data/basemaps/gz/lb2fFullV0.s | grep -i readme | join -t\; -1 1 -2 1 - <( zcat /da?_data/basemaps/gz/b2PFullV0.s) | head -n 5 439 | 00000057bfb6f79bdfd129f113533f9ada77cbba;/README.md;yoooov_certinel 440 | 000000ad43fb50661d0f8ba20035f8f8a62b28b1;/README.md;LeeYoonSam_SampleNodeEjsBoard 441 | 000000c4ca807de513cd601810522141ed8347bf;/Day-92 Collection/readMe;22MCA10027_-100daysofcodeChallenge 442 | 000001222e62cd97679e0ed087c74037bab8f848;/README.md;magnusrygh_cluster 443 | 0000013e32eb5f7497750cf652cfd540f23abb3e;/README.md;sneakyGrit_hello-world 444 | ``` 445 | 446 | Each project also has the last commit in lc2Pdat 447 | 448 | ``` 449 | [username@da0]~% zcat /da?_data/basemaps/gz/lc2PdatFullV1.s | head -n 1 | tr ";" "\n" 450 | 01000009d4c8d8f088e30519131e4e60cf61e969 451 | Dushyant099_Tetris 452 | 1500262480 453 | -0400 454 | Dushyant Patel 455 | 214c30ce8162a624f1f2442ff7bed46d0fb7b4b1 456 | 9e46a5cd45ce0adf3afe24ce616f5be0315c72b2 457 | ``` 458 | 459 | In fact, lb2f is computed from lc2dat by taking the tree (column 6 of lc2Pdat) and obtaining all blobds in that tree in, recusively, subtrees. 460 | 461 | ## Activity 5: Using Python APIs from oscar.py 462 | 463 | **oscar.py Tutorial:** oscar.py has their own tutorial for hackathon purposes. We suggest that you go [here](https://github.com/ssc-oscar/oscar.py/blob/master/docs/tutorial.md) and read through it. The tutorial contains information about the current available functions, how to implement applications (simple and complex), and useful imports for applications. 464 | 465 | **Important Note:** If you experience any difficulties in retrieving data from oscar.py's function calls (i.e., you receive an empty tuple on function return), please run `git pull` in your cloned repo to stay up-to-date with the latest version of oscar.py. 466 | 467 | These are corresponding functions in oscar.py that open the .tch files listed below for a given entity. "/" after a function name denotes the version of that function that returns a Generator object. 468 | 469 | 1. `Author('...')` - initialized with a combination of name and email 470 | * `.blobs` 471 | * `.commit_shas/commits` 472 | * `.project_names` 473 | * `.files` 474 | * `.torvald` - returns the torvald path of an Author, i.e, who did this Author work 475 | with that also worked with Linus Torvald 476 | 2. `Blob('...')` - initialized with SHA of blob 477 | * `.author` - returns timestamp, author name, and binary SHA of commit 478 | * `.commit_shas/commits` - commits removing this blob are not included 479 | * `.data` - content of the blob 480 | * `.file_sha(filename)` - compute blob sha from a file content 481 | * `.position` - get offset and length of blob data in storage 482 | * `.parent` 483 | * `.string_sha(string)` 484 | * `.tkns` - result of ctags run on this blob, if there were any 485 | 3. `Commit('...')` - initialized with SHA of commit 486 | * `.blob_shas/blobs` 487 | * `.child_shas/children` 488 | * `.changed_file_names/files_changed` 489 | * `.parent_shas/parents` 490 | * `.project_names/projects` 491 | * `.attributes` - time, tz, author, tree, parent(s) 492 | * `.tdiff` 493 | 5. Deprecated, see [#50](https://github.com/ssc-oscar/oscar.py/issues/50): `File('...')` - initialized with a path, starting from a commit root tree 494 | * `.authors` 495 | * `.blobs` 496 | * `.commit_shas/commits` 497 | 6. `Project('...')` - initialized with project name/URI 498 | * `.author_names` 499 | * `.commit_shas/commits` 500 | 7. `Tdiff('...')` - initialized with SHA, result of diff run on 2 blobs (if there was a diff) 501 | * `.commit` 502 | * `.file` 503 | 8. `Tree('...')` - representation of a git tree object (dir), initialized with SHA of tree 504 | * `.files` 505 | * `.blob_shas/blobs` 506 | * `.commit_shas/commits` 507 | * `.traverse` 508 | 509 | The non-Generator version of these functions will return a tuple of items which can then be iterated: 510 | ``` 511 | for commit in Author(author_name).commit_shas: 512 | print(Commit(commit)) 513 | ``` 514 | 515 | ### Exercise 5a: Get a list of commits made by a specific author 516 | 517 | Install the latest oscar.py 518 | 519 | ``` 520 | [username@da0]~% cd ~/oscar.py 521 | ``` 522 | 523 | If "import oscar" fails 524 | 525 | ``` 526 | [username@da0]~% easy_install --user clickhouse-driver 527 | ``` 528 | 529 | As we learned before, we can do that in shell 530 | 531 | ``` 532 | [username@da0]~% zcat /da?_data/basemaps/gz/a2cFull.V3.0.s | grep '"Albert Krawczyk" ' | head -n 3 533 | "Albert Krawczyk" ;17abdbdc90195016442a6a8dd8e38dea825292ae 534 | "Albert Krawczyk" ;2a98c68d153f1fd78cc356727263a2046abf887d 535 | "Albert Krawczyk" ;3cdd0e1cefbec43a9c3d3138dd6734191529763a 536 | ``` 537 | 538 | Now the same thing can be done using oscar.py: 539 | 540 | ``` 541 | [username@da0]~% cd oscar.py 542 | [username@da0:oscar.py]~% python3 543 | >>> from oscar import Author, Commit 544 | >>> for i, commit in enumerate(Author('"Albert Krawczyk" ').commit_shas): 545 | ... if i >= 3: 546 | ... break 547 | ... print(Commit(commit)) 548 | ... 549 | 17abdbdc90195016442a6a8dd8e38dea825292ae 550 | 2a98c68d153f1fd78cc356727263a2046abf887d 551 | 3cdd0e1cefbec43a9c3d3138dd6734191529763a 552 | >>> 553 | ``` 554 | 555 | ### Exercise 5b: Get the URL of a projects repository using the oscar.py `Project(...).url` attribute: 556 | ``` 557 | [username@da0:oscar.py]~% python3 558 | >>> from oscar import Project 559 | >>> Project('notcake_gcad').url 560 | 'https://github.com/notcake/gcad' 561 | ``` 562 | 563 | ### Exercise 5c 564 | 565 | Get list of files modified by commit 17abdbdc90195016442a6a8dd8e38dea825292ae 566 | 567 | Hint 1: What class to use? 568 | Commit 569 | ``` 570 | [username@da0:oscar.py]~% python3 571 | >>> from oscar import Commit 572 | >>> Commit('17abdbdc90195016442a6a8dd8e38dea825292ae').changed_file_names 573 | ``` 574 | 575 | ## Activity 6: Understanding Servers and folders 576 | 577 | All home folders are on da2, so it is preferred not to do very large 578 | file operations to/from these folders when running tasks on servers 579 | other than da2, 580 | since these operations will load NFS and may slow access to home 581 | folders of other users. 582 | 583 | Each server has /data/play folder where you can create your 584 | subfolders to store/process large files. 585 | 586 | ### List of relevant directories 587 | 588 | Not all files are stored on all servers due to limited disk sizes 589 | and different speed of disks (fast refers to SSDs). 590 | The location of the file can be identified via a pathname as described below. 591 | 592 | ### da0/../da5 Servers 593 | #### .{0-31}.tch files can be found in `/da[0-5]_fast/` or `/da[0-5]_data/basemaps` 594 | (.s) signifies that there are either .s or .gz versions of these files in /da[0-5]_data/basemaps/gz/ folder, which can be opened with Python gzip module or Unix zcat. 595 | all five da[0-5] server may have these .s/.gz files 596 | Keys for identifying letters: 597 | 598 | * a = Author 599 | * b = Blob 600 | * c = Commit 601 | * cc = Child Commit 602 | * f = File 603 | * h = Head Commit 604 | * ob = Parent Blob 605 | * p = Project 606 | * pc = Parent Commit 607 | * P = Forked/Root Project (see Note below) 608 | * ta = Time;Author 609 | * fa = First;Author;commit 610 | * r = root commit obtained by traversing commit history 611 | * h = head commit obtained by traversing commit history 612 | * td = Tdiff 613 | * tk = Tokens (ctags) 614 | * trp = Torvalds Path 615 | 616 | Version T keys for identifying letters: 617 | * L = LICENSE* files 618 | * Lb - blobs that are shared among fewer than 100 Projects 619 | * fb = firstblob 620 | * tac = time, author, commit 621 | * t = root tree 622 | 623 | Recall that the captal version of author A means aliased version (see 624 | https://arxiv.org/abs/2003.08349) and it also means that 625 | organizational and group IDs, bot IDs as well as author IDs that do not 626 | contain sufficient info to alias are excluded. 627 | Similarly, the capital version of project P represents a deforked project (via Leuwen community 628 | detection on commit / repo bi-graph: https://arxiv.org/abs/2002.02707) 629 | 630 | List of relationships can be obtained via 631 | ``` 632 | echo $(ls /da?_data/basemaps/gz/*FullV0.s| sed 's|.*/||;s|FullV0.s||') 633 | A2P A2c A2mnc P2A P2a P2c P2core P2g P2mnc P2tac a2P a2c a2p c2P c2acp c2cc c2dat c2p c2pc p2a 634 | p2c A2b A2f A2fb A2tPc A2tPlPkg A2tspan P2b P2binf P2f P2fb P2nfb P2tAlPkg P2tspan Pkg2tPA Pt2Ptb 635 | Ptb2Pt a2f a2fb b2P b2def b2fA b2f b2fa b2ob b2ptf b2tA b2tP b2ta b2tk b bb2cf c2PtAbflDef 636 | c2PtAbflPk g c2PtabflDef c2PtabflPkg c2b c2f c2fbb lb2f lc2Pdat ob2b obb2cf t2all t2ptf tk2b 637 | ``` 638 | 639 | 640 | ``` 641 | * a2b * a2c (.s) * a2f * a2ft 642 | * a2p (.s) * a2trp0 (.s) 643 | * b2a * b2tac (.s) * b2f (.s) * b2ob * ob2b 644 | * b2tk 645 | * c2b (.s) * c2cc * c2f (.s) * c2h * c2pc 646 | * c2p (.s) * c2P * c2ta (.s) * c2td 647 | * p2a (.s) * p2c (.s) * P2c 648 | * td2c * td2f 649 | ``` 650 | 651 | Special relationships (names do not correspond to keys): 652 | ``` 653 | Versions T or U: 654 | b2f[aA] - blob to time, author, commit for the first commit creating that blob 655 | b2tac - blob to time, author, commit for all commits creating that blob 656 | bb2cf - result of diff on a commit: blob old blob, commit, file 657 | obb2cf - see bb2cf but blobs reversed 658 | c2fbb - result of diff on a commit: commit file, blob, old blob 659 | 660 | P2core - Project to devs who make 80+% of the commits 661 | 662 | b2fLICENSE - grep for LICESNSE in b2f 663 | bL2P - license blob to project 664 | 665 | c2dat - full commit data in semicolon-separated fields 666 | 667 | dl2Pf - API defined; language; project; file 668 | ==== 669 | ``` 670 | 671 | Note: c2P returns the most central repository for this commit, and does not include repos that forked off of this commit. 672 | P2c returns ALL commits associated with this repo, including commits made to forks of this particular repo. 673 | The list of relationships is not exhaustive and more information can be found at https://github.com/woc-hack/tutorial/issues/17#issuecomment-850823408 674 | 675 | ### Exercise 6 676 | 677 | Find all blobs associated with Julia language files (extension .jl) 678 | 679 | Hint 1: What is the name of the map? 680 | 681 | ``` 682 | [username@da0] zcat /da?_data/basemaps/gz/b2fFullU*.s | grep '\.jl;' 683 | ``` 684 | 685 | ## Activity 7: Investigating Technical dependencies 686 | 687 | The technical dependencies have been extracted by parsing the content of all blobs related to 688 | several different languages: and, for version V, are located in 689 | `/da7_data/basemaps/gz/c2PtAbflPkgFullVX.s` with X ranging from 0 690 | to 127 based on the 7 bits in the first byte of the commit sha1. 691 | 692 | 693 | The format of each file is encoded in its name: 694 | ``` 695 | commit;deforked repo;timestamp;Aliased author;blob;filename;language (as used in WoC);module1;module2;...` 696 | 697 | ``` 698 | for example 699 | ``` 700 | 000000000fcd56c8536abd09cac5f2a54ba600c2;not-an-aardvark_lucky-commit;1510643197;Teddy Katz ;d9730ab3fca05f4d61e7225de5731868cfb99fb6;lucky-commit.c;C;errno.h;string.h;math.h;zlib.h;stdio.h;sha.h;stdbool.h;stdlib.h;stat.h 701 | ``` 702 | 703 | Unlike in version R where each language had a separate thruMaps 704 | directory, info on all languages is kept in a single place. 705 | 706 | To identify the implementation of various packages one can use 707 | `/da?_data/basemaps/gz/c2PtAbflDefFullUX.s` with X ranging from 0 708 | to 127 based on the 7 bits in the first byte of the commit sha1. 709 | for example 710 | ``` 711 | zcat /da?_data/basemaps/gz/c2PtAbflDefFullU0.s|head 712 | 0000000000abc668c5388237320e97d0dadae7b1;not-an-aardvark_lucky-commit;1613716402;Teddy Katz ;050e87971a0a069043821c8d5f0c55d1f4761edc;Cargo.toml;Rust;lucky_commit 713 | 0000000000abc668c5388237320e97d0dadae7b1;not-an-aardvark_lucky-commit;1613716402;Teddy Katz ;61aebdecc47b2b7521a353b1cc180b2af1080977;Cargo.lock;Rust;addr2line 714 | ``` 715 | Instead of the list of dependencies the last field identifies the 716 | package implemented within the blob, specifically, lucky_commit and 717 | addr2line in the above two blobs. 718 | 719 | The Def relationship in WoC tracks blobs that define a package based 720 | on the content of the source code. There is no guarantee that only one project will have it due to copying and other reasons. Identifying the which repository is the true upstream one may not be that difficult, however. 721 | 722 | Def relationship points only to blobs that define the package (e.g., blobs for file setup.py in Python, packages.json in JavaScript, etc.). This can be used to identify 723 | repositories (or parts of the repositories) where these package metafiles reside. 724 | 725 | *TODO*:put it into clickhouse to speed up access. 726 | 727 | Lets get a list of commits and repositories that imported Tensorflow for .py files: 728 | ``` 729 | [username@da0]~%zcat c2PtAbflPkgFullU76.s |grep tensorflow|head -2 730 | 000005efe300482514d70d44c5fa922b34ff79a5;Rayhane-mamah_Tacotron-2;1557284915;qq443452099 <47710489+qq443452099@users.noreply.github.com>;05604b3f0632e98cc0eee3afef589dc5031f3a43;tacotron/synthesizer.py;PY;tacotron.utils.text.text_to_sequence;tacotron.utils.plot;tacotron.models.create_model;wave;datasets.audio;os;librosa.effects;tensorflow;infolog.log;datetime.datetime;io;numpy 731 | 000005efe300482514d70d44c5fa922b34ff79a5;Rayhane-mamah_Tacotron-2;1557284915;qq443452099 <47710489+qq443452099@users.noreply.github.com>;49bc3b8b6533b93941223ccbeb401e47e5a573d7;hparams.py;PY;tensorflow;numpy 732 | ``` 733 | 734 | ### Exercise 7 735 | 736 | Find all repositories using Julia language that import package 'StaticArrays' 737 | 738 | 739 | Hint 1: What file to look for? 740 | ``` 741 | [username@da0]~% zcat /da?_data/basemaps/gz/c2PtabllfPkgFullS*.s | grep ';jl;' | grep StaticArrays 742 | ``` 743 | 744 | Hint 2: What field contains the repository name? 745 | ``` 746 | [username@da0]~% zcat /da?_data/basemaps/gz/c2PtabllfPkgFullS*.s | grep ';jl;'| grep StaticArrays | cut -d\; -f2 | sort -u 747 | ``` 748 | 749 | ## Activity 8: Investigating Copy-Based Reuse 750 | 751 | WoC's operationalization of copy-based supply chains is based on mapping blobs 752 | (versions of the source code) to all commits and projects where they have been created. 753 | For each blob, all commits are sorted based on their timestamp and the project in which 754 | the first commit exists is identified as the originator and all other projects as the reuser 755 | of that blob. These files are located in `/da?_data/basemaps/gz/Ptb2PtFullVX.s` with X ranging from 0 756 | to 127 based on the 7 bits in the first byte of the blob sha1. 757 | 758 |
759 | reuse_DFD 760 |

Reuse Identification Data Flow Diagram

761 |
762 | 763 | The format of each file is encoded in its name: 764 | ``` 765 | originating repo;timestamp;blob;destination repo;timestamp 766 | 767 | ``` 768 | for example 769 | ``` 770 | zhunengfei_ExtJS6.2-samples;1466402956;00000056a59bde3926f65c334caef688ccad0a08;bitbucket.org_mastercad_sencha_demo;1551632725 771 | ``` 772 | This means that blob 00000056a59bde3926f65c334caef688ccad0a08 was first seen in zhunengfei_ExtJS6.2-samples at 1466402956 773 | and was reused by bitbucket.org_mastercad_sencha_demo at 1551632725. 774 | 775 | ## Activity 9: OSS License Identification 776 | 777 | The proliferation of OSS has created a complex landscape of licensing practices, making accurate license identification essential for legal and compliance purposes. 778 | WoC uses a comprehensive approach, scanning all blobs with "license" in their filepath and applying the winnowing algorithm for reliable text matching against known licenses. 779 | 780 | This method successfully identifies and matches over 5.5 million unique license blobs across projects, generating a detailed project to license map. 781 | 782 | This map is stored at `/da?_data/basemaps/gz/P2LtFullV.s`. 783 | 784 | The file format is encoded as follows: 785 | ``` 786 | deforkedProject;License;time 787 | ``` 788 | The "time" field is in the "YYYY-MM" format and represents the commit timestamp when the license blob was committed to the project. This field may also have an "invalid" value, indicating that the commit timestamp was not valid (e.g., a future time due to discrepancies in the user's system time). 789 | 790 | Additionally, since these timestamps only represent when the license was committed to the project and do not indicate whether the license is still present, the latest commit tree (before the WoC version V data collection date, 2023-05) of each project was examined. If the license blob was found in the latest commit, a record was added with the time set as "latest." 791 | 792 | When interpreting the data, it's important to note that the scope of license detection does not include code files or references to licenses within project documentation. 793 | 794 | ## Activity 10: Suggested by the audience 795 | 796 | Find all projects that have commits mentioning "sql injection" 797 | 798 | List of commits is on /da4_data/All.blobs/ 799 | Lets login to da4, create a data folder to store temporary data on the same server 800 | "/data/play/username", and uce pcommit to project map to get the list of projects. 801 | 802 | ``` 803 | [username@da0]~% ssh da4 804 | [username@da4]~% mkdir /data/play/audris 805 | [username@da4]~% cd /data/play/audris 806 | [username@da4:/data/play/audris]~% cut -d\; -f4 commit_*.idx | ~/lookup/showCmt.perl 2 | grep -i 'sql injection' > 807 | [username@da4:/data/play/audris]~% cut -d\; -f1 sql_inject | ~/lookup/getValues.perl /da0_data/basemaps/c2pFullP > sql_inject.c2p 808 | ``` 809 | 810 | ## Activity 11: Summary of the activities undertaken 811 | 812 | * Shell API (faster) and Python API (also Perl API not illustrated) for random access 813 | 814 | * Sorted compressed tables for sweeps (grep) 815 | 816 | * Key-Value maps to link authors, commits, files, projects, and blobs 817 | 818 | * Overview of naming conventions server/data/databases 819 | 820 | * Mongodb tables with the summary information about authors and projects to enable selection of subsets for later analysis: (e.g, I want authors with at least 100 commits who worked no less than three years and participated in at least five java projects.) 821 | 822 | ### Summary Activity 11 823 | 824 | * What type of usability improvements are needed? 825 | 826 | * What types of tasks would you like to work during the hackathon? 827 | 828 | * What would make you a long-time user of WoC? 829 | 830 | 831 | ## Self paced part of the tutorial 832 | 833 | The remainig activities are provided to illustrate various realistic 834 | tasks. 835 | 836 | ## Activity S0: Finding 1st-time imports for AI modules (Simple) 837 | 838 | Given the data available, this is a fairly simple task. Making an application to detect the first time that a repo adopted an AI module would give you a better idea as to when it was first used, and also when it started to gain popularity. 839 | 840 | A good example of this is in 841 | [popmods.py](https://github.com/ssc-oscar/aiframeworks/blob/master/popmods.py). In 842 | this application, we can read all 128 c2PtabllfPkgFullS*.s 843 | files and look for a given module with the earliest import times. The program then creates a .first file, with each line formatted as `repo_name;UNIX_timestamp`. 844 | 845 | TODO: update popmods.py to work with c2PtabllfPkgFullS*.s 846 | Usage: `[username@da0]~% python popmods.py language_file_extension module_name` 847 | 848 | Before anything else (and this can be applied to many other 849 | programs), you want to know what your input looks like ahead of time 850 | and know how you are going to parse it. 851 | Since each line of the file has this format: 852 | ``` 853 | commit;deforked repo;timestamp;author;blob;language (as used in WoC);language (as determined by ctags);filename;module1;module2;... 854 | ``` 855 | 856 | We can use the `string.split()` method to turn this string into a list of words, split by a semicolon (;). 857 | By turning this line into a list, and giving it a variable name, `entry = ['commit', 'repo_name', 'timestamp', ...]`, we can then grab the pieces of information we need with `repo, time = entry[1], entry[2]`. 858 | 859 | An important idea to keep in mind is that we only want to count unique timestamps once. This is because we want to account for repositories that forked off of another repository with the exact timestamp of imports. An easy way to do this would be to keep a running list of the times we have come across, and if we have already seen that timestamp before, we will simply skip that line in the file: 860 | ``` 861 | ... 862 | if time in times: 863 | continue 864 | else: 865 | times.append(time) 866 | ... 867 | ``` 868 | We also want to find the earliest timestamp for a repository importing a given module. Again, this is fairly simple: 869 | ``` 870 | ... 871 | if repo not in dict.keys() or time < dict[repo]: 872 | for word in entry[5:]: 873 | if module in word: 874 | dict[repo] = time 875 | break 876 | ... 877 | ``` 878 | #### Implementing the application 879 | 880 | Now that we have the .first files put together, we can take this one step further and graph a modules first-time usage over time on a line graph, or even compare multiple modules to see how they stack up against each other. [modtrends.py](https://github.com/ssc-oscar/aiframeworks/blob/master/modtrends.py) accomplishes this by: 881 | 882 | * reading 1 or more .first files 883 | * converting each timestamp for each repository into a datetime date 884 | * "rounding" those dates by year and month 885 | * putting those dates in a dictionary with `dict["year-month"] += 1` 886 | * graphing the dates and frequencies using matplotlib. 887 | 888 | If you want to compare first-time usage over time for Tensorflow and 889 | Keras for the .ipynb language .first files you created, run: 890 | ``` 891 | UNIX> python3.6 modtrends.py tensorflow.first keras.first 892 | ``` 893 | The final graph looks something like this: 894 | [![Tensorflow vs Keras](../ipynb_first/Tensorflow-vs-Keras.png "Tensorflow vs Keras")](https://github.com/ssc-oscar/aiframeworks/blob/master/charts/ipynb_charts/Tensorflow-vs-Keras.png) 895 | 896 | 897 | 898 | ## Activity S1: Detecting percentage language use and changes over time (Complex) 899 | 900 | An application to calculate this would be useful for seeing how different authors changed languages over a range of years, based on the commits they have made to different files. 901 | In order to accomplish this task, we will modify an existing program from the swsc/lookup repo ([a2fBinSorted.perl](https://bitbucket.org/swsc/lookup/src/master/a2fBinSorted.perl)) and create a new program ([a2L.py](https://bitbucket.org/swsc/lookup/src/master/a2L.py)) that will get language counts per year per author. 902 | 903 | #### Part 1 -- Modifying a2fBinSorted.perl 904 | For the first part, we look at what a2fBinSorted.perl currently does: it takes one of the 32 a2cFullP{0-31}.s files thru STDIN, opens the 32 c2fFullO.{0-31}.tch files for reading, and writes a corresponding a2fFullP.{0-31}.tch file based on the a2c file number. The lines of the file being `author_id;file1;file2;file3...` 905 | 906 | Example usage: `UNIX> zcat /da0_data/basemaps/gz/a2cFullP0.s | ./a2fBinSorted.perl 0` 907 | 908 | We can modify this program so that it will write the earliest commit dates made by that author for those files, which will become useful for a2L.py later on. To accomplish this, we will have the program additionally read from the c2taFullP.{0-31}.tch files so we can get the time of each commit made by a given author: 909 | ``` 910 | my %c2ta; 911 | for my $s (0..($sections-1)){ 912 | tie %{$c2ta{$s}}, "TokyoCabinet::HDB", "/fast/c2taFullP.$s.tch", TokyoCabinet::HDB::OREADER | 913 | TokyoCabinet::HDB::ONOLCK, 914 | 16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000 915 | or die "cant open fast/c2taFullP.$s.tch\n"; 916 | } 917 | ``` 918 | We will also ensure the files to be written will have the relationship a2ft as oppposed to a2f: 919 | ``` 920 | my %a2ft; 921 | tie %a2ft, "TokyoCabinet::HDB", "/data/play/dkennard/a2ftFullP.$part.tch", TokyoCabinet::HDB::OWRITER | 922 | TokyoCabinet::HDB::OCREAT, 923 | 16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000 924 | or die "cant open /data/play/dkennard/a2ftFullP.$part.tch\n"; 925 | ``` 926 | Another important part of the file we want to change is inside the `output` function: 927 | ``` 928 | sub output { 929 | my $a = $_[0]; 930 | my %fs; 931 | for my $c (@cs){ 932 | my $sec = segB ($c, $sections); 933 | if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){ 934 | my @fs = split(/\;/, safeDecomp ($c2f{$sec}{$c}, $a), -1); 935 | my ($time, $au) = split(/\;/, $c2ta{$sec}{$c}, -1); #add this for grabbing the time 936 | for my $f (@fs){ 937 | if (defined $time and (!defined $fs{$f} or $time < $fs{$f})){ #modify condition to grab earliest time 938 | $fs{$f} = $time; 939 | } 940 | } 941 | } 942 | } 943 | $a2ft{$a} = safeComp (join ';', %fs); #changed 944 | } 945 | ``` 946 | Now when we run the new program, it should write individual a2ftFullP.{0-31}.tch files with the format: 947 | `author_id;file1;file1_timestamp;file2;file2_timestamp;...` 948 | 949 | We can then create a new PATHS dictionary entry in oscar.py, as well as making another function under the Author class to read our newly-created .tch files: 950 | ``` 951 | In PATHS dictionary: 952 | ... 953 | 'author_file_times': ('/data/play/dkennard/a2ftFullP.{key}.tch', 5) 954 | ... 955 | 956 | In class Author(_Base): 957 | ... 958 | @cached_property 959 | def file_times(self): 960 | data = decomp(self.read_tch('author_file_times')) 961 | return tuple(file for file in (data and data.split(";"))) 962 | ... 963 | ``` 964 | 965 | #### Part 2 -- Creating a2L.py 966 | Our next task involves creating a2LFullP{0-31}.s files utilizing the new .tch files we have just created. We want these files to have each line filled with the author name, each year, and the language counts for each year. A possible format could look something like this: 967 | `"tim.bentley@gmail.com" <>;year2015;2;py;31;js;30;year2016;1;py;29;year2017;8;c;2;doc;1;py;386;html;6;sh;1;js;3;other;3;build;1` 968 | where the number after each year represents the number of languages used for that year, followed by pairs of languages and the number of files written in that language for that year. As an example, in the year 2015, Tim Bentley made initial commits to files in 2 languages, 31 of which were in Python, and 30 of which were in JavaScript. 969 | 970 | There is a number of things that have to happen to get to this point, so lets break it down: 971 | 972 | * Iterating Author().file_times and grouping timestamps into year 973 | 974 | We will start by reading in a a2cFullP{0-31}.s file to get a list of authors, which we then hold as a tuple in memory and start building our dictionary: 975 | ``` 976 | a2L[author] = {} 977 | file_times = Author(author).file_times 978 | for j in range(0,len(file_times),2): 979 | try: 980 | year = str(datetime.fromtimestamp(float(file_times[j+1]))).split(" ")[0].split("-")[0] 981 | #have to skip years either in the 20th century or somewhere far in the future 982 | except ValueError: 983 | continue 984 | #in case the last file listed doesnt have a time 985 | except IndexError: 986 | break 987 | year = specifier + year #specifier is the string 'year' 988 | if year not in a2L[author]: 989 | a2L[author][year] = [] 990 | a2L[author][year].append(file_times[j]) 991 | ``` 992 | The datetime.fromtimestamp() function will turn this into a datetime format: `year-month-day hour-min-sec` which we split by a space to get the first half `year-month-day` of the string, and then split again to get `year`. 993 | 994 | * Detecting the language of a file based on file extension 995 | ``` 996 | for year, files in a2L[author].items(): 997 | build_list = [] 998 | for file in files: 999 | la = "other" 1000 | if re.search("\.(js|iced|liticed|iced.md|coffee|litcoffee|coffee.md|ts|cs|ls|es6|es|jsx|sjs|co|eg|json|json.ls|json5)$",file): 1001 | la = "js" 1002 | elif re.search("\.(py|py3|pyx|pyo|pyw|pyc|whl|ipynb)$",file): 1003 | la = "py" 1004 | elif re.search("(\.[Cch]|\.cpp|\.hh|\.cc|\.hpp|\.cxx)$",file): 1005 | la = "c" 1006 | ....... 1007 | ``` 1008 | The simplest way to check for a language based on a file extension is to use the re module for regular expressions. If a given file matches a certain expression, like `.py`, then that file was written in Python. `la = other` if no matches were found in any of those searches. We then keep track of these languages and put each language in a list `build_list.append(la)`, and count how many of those languages occurred when we looped through the files `build_list.count(lang)`. The final format for an author in the a2L dictionary will be `a2L[author][year][lang] = lang_count`. 1009 | 1010 | * Writing each authors information into the file 1011 | 1012 | See [a2L.py](https://bitbucket.org/swsc/lookup/src/master/a2L.py) for how information is written into each file. 1013 | 1014 | Usage: `UNIX> python a2L.py 2` for writing `a2LFullP2.s` 1015 | 1016 | #### Implementing the application 1017 | Now that we have our a2L files, we can run some interesting statistics as to how significant language usage changes over time for different authors. The program [langtrend.py](https://bitbucket.org/swsc/lookup/src/master/langtrend.py) runs the chi-squared contingency test (via the stats.chi2_contingency() function from scipy module) for authors from an a2LFullP{0-31}.s file and calculates a p-value for each pair of years for each language for each author. 1018 | This p-value means the percentage chance that you would find another person (say out of 1000 people) that has this same extreme of change in language use, whether that be an increase or a decrease. For example, if a given author editied 300 different Python files in 2006, but then editied 500 different Java files in 2007, the percentage chance that you would see this extreme of a change in another author is very low. In fact, if this p-value is less than 0.001, then the change in language use between a pair of years is considered "significant". 1019 | 1020 | In order for this p-value to be a more accurate approximation, we need a larger sample size of language counts. When reading the a2LFullP{0-31}.s files, you may want to rule out people who dont meet certain criteria: 1021 | 1022 | * the author has at least 5 consecutive years of commits for files 1023 | * the author has edited at least 100 different files for all of their years of commits 1024 | 1025 | If an author does not meet this criteria, we would not want to consider them for the chi-squared test simply because their results would be "uninteresting" and not worth investigating any further. 1026 | 1027 | Here is one of the authors from the programs output: 1028 | ``` 1029 | ---------------------------------- 1030 | Ben Niemann 1031 | { '2015': {'doc': 3, 'markup': 2, 'obj': 1, 'other': 67, 'py': 127, 'sh': 1}, 1032 | '2016': {'doc': 1, 'other': 23, 'py': 163}, 1033 | '2017': {'build': 36, 'c': 116, 'lsp': 1, 'other': 81, 'py': 160}, 1034 | '2018': { 'build': 12, 1035 | 'c': 134, 1036 | 'lsp': 2, 1037 | 'markup': 2, 1038 | 'other': 133, 1039 | 'py': 182}, 1040 | '2019': { 'build': 13, 1041 | 'c': 30, 1042 | 'doc': 8, 1043 | 'html': 10, 1044 | 'js': 1, 1045 | 'lsp': 2, 1046 | 'markup': 16, 1047 | 'other': 67, 1048 | 'py': 134}} 1049 | pfactors for obj language 1050 | 2015--2016 pfactor == 0.9711606775110577 no change 1051 | pfactors for doc language 1052 | 2015--2016 pfactor == 0.6669499228133753 no change 1053 | 2016--2017 pfactor == 0.7027338745275937 no change 1054 | 2018--2019 pfactor == 0.0009971248193242038 rise/drop 1055 | pfactors for markup language 1056 | 2015--2016 pfactor == 0.5104066960256399 no change 1057 | 2017--2018 pfactor == 0.5532258789014389 no change 1058 | 2018--2019 pfactor == 1.756929555308731e-05 rise/drop 1059 | pfactors for py language 1060 | 2015--2016 pfactor == 1.0629725495084215e-07 rise/drop 1061 | 2016--2017 pfactor == 1.2847558344252341e-25 rise/drop 1062 | 2017--2018 pfactor == 0.7125543569718793 no change 1063 | 2018--2019 pfactor == 0.026914075872778477 no change 1064 | pfactors for sh language 1065 | 2015--2016 pfactor == 0.9711606775110577 no change 1066 | pfactors for other language 1067 | 2015--2016 pfactor == 1.7143130378377696e-06 rise/drop 1068 | 2016--2017 pfactor == 0.020874234589765908 no change 1069 | 2017--2018 pfactor == 0.008365948846657284 no change 1070 | 2018--2019 pfactor == 0.1813919210757513 no change 1071 | pfactors for c language 1072 | 2016--2017 pfactor == 2.770649054044977e-16 rise/drop 1073 | 2017--2018 pfactor == 0.9002187643203734 no change 1074 | 2018--2019 pfactor == 1.1559110387953382e-08 rise/drop 1075 | pfactors for lsp language 1076 | 2016--2017 pfactor == 0.7027338745275937 no change 1077 | 2017--2018 pfactor == 0.8855759560371912 no change 1078 | 2018--2019 pfactor == 0.9944669523033288 no change 1079 | pfactors for build language 1080 | 2016--2017 pfactor == 4.431916568235125e-05 rise/drop 1081 | 2017--2018 pfactor == 5.8273175348446296e-05 rise/drop 1082 | 2018--2019 pfactor == 0.1955154860787908 no change 1083 | pfactors for html language 1084 | 2018--2019 pfactor == 0.0001652525618661536 rise/drop 1085 | pfactors for js language 1086 | 2018--2019 pfactor == 0.7989681687355706 no change 1087 | ---------------------------------- 1088 | ``` 1089 | 1090 | Although it is currently not implemented, one could take this one step further and visually represent an authors language changes on a graph, which would be simpler to interpret as opposed to viewing a long list of p values such as the one shown above. 1091 | 1092 | ## Activity S2: Useful Python imports for applications 1093 | ### subprocess 1094 | Simlar to the C version, system(), this module allows you to run UNIX processes, and also allows you to gather any input, output, or error from those processes, all from within a Python script. This module becomes especially useful when you are looking for specific lines out of a .s/.gz file, as opposed to reading the entire file which takes more time. 1095 | A good example usage for subprocess is when we read the c2bPtaPkgO$LANG.{0-31}.gz files for first-time AI module imports in popmods.py. Rather than reading one of these files in its entirety, we look for lines of the file that have a specific module we are looking for. 1096 | ``` 1097 | for i in range(32): 1098 | print("Reading gz file number " + str(i)) 1099 | command = "zcat /data/play/" + dir_lang + "thruMaps/c2bPtaPkgO" + dir_lang + "." + str(i) + ".gz" 1100 | p1 = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE) 1101 | p2 = subprocess.Popen("egrep -w " + module, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE) 1102 | output = p2.communicate()[0] 1103 | ``` 1104 | We can then iterate the lines of this output accordingly, and gather the pieces of information we need: 1105 | ``` 1106 | for entry in str(output).rstrip('\n').split("\n"): 1107 | entry = str(entry).split(";") 1108 | repo, time = entry[1], entry[2] 1109 | ``` 1110 | 1111 | Additional documentation on subprocess can be found [here](https://docs.python.org/2/library/subprocess.html). 1112 | 1113 | ---------- 1114 | ### re 1115 | The re (Regular Expression) module is another useful import for pattern-matching in strings. 1116 | Addtional re documentation can be found [here](https://docs.python.org/2/library/re.html). 1117 | 1118 | ---------- 1119 | ### matplotlib 1120 | Useful graphing module for creating visual representations. 1121 | Extensive documentation can be found [here](https://matplotlib.org/). 1122 | 1123 | ## Activity S3: a comparison of oscar.py vs. Perl scripts 1124 | 1125 | When it comes to creating new relationship files (.tch/.s files), using Perl over Python for large data-reading is more time-saving overall. This situation occurred in the complex application we covered where we modified an existing Perl file to get the initial commit times of each file for each author, rather than using Python to accomplish this task. 1126 | Before making this decision, one of our team members decided to run a test between 2 programs, [a2ft.py](https://bitbucket.org/swsc/lookup/src/master/a2ft.py) and [a2ft.perl](https://bitbucket.org/swsc/lookup/src/master/a2ft.perl). These programs were run at the same time for a period of 10 minutes. Both programs had the same task of retrieving the earliest commit times for each file under each author from a2cFullP{0-31}.s files. The Python version calls the `Commit_info().time_author` and `Commit().changed_file_names` functions from oscar.py. The Perl version ties each of the 32 c2fFullO.{0-31}.tch (Commit().changed_file_names) and c2taFullP.{0-31}.tch (Commit_info().time_author) files into 2 different Perl hashes (Python dictionary equivalent), %c2f and %c2ta. The speed difference between Perl and Python was quite surprising: 1127 | 1128 | ``` 1129 | [username@da3]/data/play/dkennard% ll a2ftFullP0TEST1.s 1130 | -rw-rw-r--. 1 dkennard dkennard 980606 Jul 22 11:56 a2ftFullP0TEST1.s 1131 | [username@da3]/data/play/dkennard% ll a2ftFullPTEST2.0.tch 1132 | -rw-r--r--. 1 dkennard dkennard 663563424 Jul 22 11:56 a2ftFullPTEST2.0.tch 1133 | ``` 1134 | 1135 | Within this 10 minute period, the Python version only wrote 980,606 bytes of data into the TEST1 file shown above, whereas the Perl version wrote 663,563,424 bytes into the TEST2 file. 1136 | The main reason oscar.py is slower, in theory, is because oscar.py has more private function calls that it has to perform in order to calculate the key (0-31) and locate where the requested information is stored. Upon further inspection of the [oscar.py](https://github.com/ssc-oscar/oscar.py/blob/master/oscar.py) functions that are called, we can see that there are between 6-7 function calls for each lookup. All of these function calls cause function overhead and thus increase the amount of time to retrieve data for multiple entities. 1137 | In the Perl version of a2ft, the program simply calls `segB()`, which calculates the key of where the information is stored. The function takes a string and the number 32 as arguments (ex. segB(commit_sha, 32)): 1138 | 1139 | ``` 1140 | sub segB { 1141 | my ($s, $n) = @_; 1142 | return (unpack "C", substr ($s, 0, 1))%$n; 1143 | } 1144 | ``` 1145 | 1146 | Because the %c2f and %c2ta Perl hashes are tied to their respective .tch files, we can then check if a specific commit in a specific number section is defined: 1147 | 1148 | ``` 1149 | for my $c (@cs){ #where cs is a list of commits for an author and c is one of those commits 1150 | my $sec = segB ($c, $sections); 1151 | if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){ 1152 | ... 1153 | } 1154 | ... 1155 | } 1156 | ``` 1157 | 1158 | This is not to say that oscar.py is inefficient and should not be utilized, but it is not the optimal solution for creating new .tch or .s relationship files. oscar.py solely provides a Python interface for gathering requested data out of the respective .tch files and not for mass-reading all 32 files. It also provides simple function calls that were mentioned earlier in the tutorial for retrieving bits of information at a time in a more convenient way. 1159 | 1160 | 1161 | ## Activity S4: Plumbing of WoC 1162 | 1163 | We can obtain a diff for any commit. It requires comparing trees of it and its parent: 1164 | 1165 | Lets find the diff for 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a: 1166 | ``` 1167 | [username@da0]~% echo 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a | ~/lookup/cmputeDiff 1168 | 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a;/sys/dev/pccbb/pccbb_isa.c;9d5818e25865797b96e4783b00b45f800423e527;594dc8cb2ce725658377bf09aa0f127183b89f77 1169 | 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a;/sys/dev/pccbb/pccbb_pci.c;b3c1363c90de7823ec87004fe084f41d0f224c9b;4155935a98ba3b5d3786fa1b6d3d5aa52c6de90a 1170 | ``` 1171 | We can see it modifying two files. 1172 | 1173 | ### Exercise 1174 | 1175 | Calculate the change made to /sys/dev/pccbb/pccbb_isa.c by commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a. 1176 | 1177 | Hint1: Get the old and new blob for /sys/dev/pccbb/pccbb_isa.c 1178 | 1179 | Hint2: Use shell redirect output '>' to save the content of each blob 1180 | 1181 | Hint3: Use unix diff to calculate the difference 1182 | ``` 1183 | [username@da0]~% diff old new 1184 | ``` 1185 | 1186 | ## Iterating over a dataset 1187 | 1188 | Sometimes, iterating over the entire dataset using the already created basemaps is the only way to retrieve the desired information. The basemappings from one datatype to another are key-value pairings of data. As such, the retrieval of the entire dataset can usually be done in one pass over one of the already created basemaps. 1189 | 1190 | For example, if the goal was to determine information pertaining to each author in WoC, simply iterating over one of the many basemaps from author to some other dataset e.g. (a2b, a2c, etc) will serve. Since these datasets are a key-value mapping from author to another dataset, this guarantees that each of the keys will be one of the unique authors in WoC. From there, the desired information about that specific author can be determined. 1191 | 1192 | Below is a Perl script template that allows for retrieval of all the authors from a2c. 1193 | 1194 | ----------------------- 1195 | ```perl 1196 | #!/usr/bin/perl -I /home/audris/lookup -I /home/audris/lib64/perl5 -I /home/audris/lib/x86_64-linux-gnu/perl -I /usr/local/lib64/perl5 -I /usr/local/share/perl5 1197 | use strict; 1198 | use warnings; 1199 | use Error qw(:try); 1200 | 1201 | use TokyoCabinet; 1202 | 1203 | my $split = 32; 1204 | my %a2cF = (); 1205 | for my $sec (0..($split-1)){ 1206 | my $fname = "/da1_data/basemaps/a2cFullS.$sec.tch"; 1207 | tie %{$a2cF{$sec}}, "TokyoCabinet::HDB", "$fname", TokyoCabinet::HDB::OREADER | TokyoCabinet::HDB::ONOLCK, 16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000 1208 | or die "cant open $fname\n"; 1209 | } 1210 | while (my ($i, $author) = each %a2cF) { 1211 | my $v1 = join ";", sort keys %{$author}; 1212 | my @apl = split(';', $v1); 1213 | for my $a (@apl) { 1214 | print "$a\n"; 1215 | } 1216 | } 1217 | 1218 | ``` 1219 | --------------- 1220 | This script simply prints each WoC authors name. This helps illustrate how to go about retrieving the keys in a key-value basemap using Perl, but lacks any practical use on it's own. 1221 | 1222 | Notice in this script the $split is defined to be 32 and the for loop iterates from 0 to 31. The reason for this is because of how the data is stored in the basemaps. Each basemap from one data type to another is split into 32 roughly equal parts based on their hashes. As such, in order to iterate over the entire data set, it is neccesary to look at each of these files separately. 1223 | 1224 | From there, Perl allows for direct tying to each of these files in the format of a hash. Because the basemappings are saved using TokyoCabinet, it requires them to be opened using TokyoCabinet to retrieve the data. 1225 | 1226 | Once the hash is tied to the mapping, iterating over the hash can be done, and retrieval of the information simply becomes accessing the elements. 1227 | 1228 | ## Mongo Database 1229 | 1230 | On the da1 server, there is a MongoDB server holding some relevant 1231 | data. This data includes some information that was used for data 1232 | analysis in the past. Mongo provides an excellent place to store 1233 | relatively small data without requiring relational information. 1234 | 1235 | Two collections the WoC database cand be helpful for sampling 1236 | projects and authors A_metadata.V and P.metadata.V where V 1237 | represents the version (e.g., T) , A stands for aliased author id 1238 | and P for deforked repository name. 1239 | 1240 | ### MongoDB Access 1241 | 1242 | When on the da1 server, you can gain access to the MongoDB server simply by running the command 'mongo', or, when on any other da server, you can gain access by running 'mongo --host "da1.eecs.utk.edu"'. 1243 | 1244 | Once on the server, you can see all the available databases using the "show dbs" command. However, the database that pertains primarily to the WoC is the WoC database. 1245 | 1246 | Most databases are used for teaching and other tasks, spo please use 1247 | WoC database using the 'use "database name"' command, E.G. (use WoC), and, after switching, you can view the available collections in the database by using the 'show collections' command. 1248 | 1249 | Currently, there is an author metadata collection (A_metadata.U) 1250 | that contains basic stats: the total number of projects, 1251 | the total number of blobs created by them (before 1252 | anyone else), the total number 1253 | of commits made, the total number of files they have modified, the 1254 | distribution of language files modified by that author, 1255 | and the first and last time the author committed in Unix Timestamp based on the 1256 | data contained on version S of WoC. Author names have ben aliased 1257 | and the number of aliases and the list are also included in the record. 1258 | Furthermore, up to 100 most commonly used API (packages) in author modified files are 1259 | also included. 1260 | 1261 | Alongside this, there is a similar collection for projects on WoC 1262 | (P_metadata.U) that contains the total number of authors on the 1263 | project, the total number of commits, the total number of files, the 1264 | distribution of languages used, the first and last time there was a 1265 | commit to the project in Unix Timestamp based on the version U of 1266 | WoC. Since the project is deforked, the community size (the number 1267 | of other projects that share commits with the deforeked project) is 1268 | also provided. WoC relation P2p can be used to list other projects 1269 | linked to it. We also provide additional info based on linking to 1270 | attributes that exist only in GitHub. That data is not as recent, 1271 | however and more work is needed to make it complete. These 1272 | attributes include the number of stars GitHub has given that project 1273 | (if any), if the project is a GitHub fork, and where it was forked 1274 | from (if anywhere). 1275 | 1276 | Finally the collection of APIs or packages contains summary of the first and last time the package was used in a modified file as well as the number of commits, authors and repositories associated with the use of that package. 1277 | 1278 | To see data in one of the collections, you can run the 'db."collection name".findOne()' command. This will show the first element in the collection and should help clarify what is in the collection. 1279 | 1280 | 1281 | When the above findOne() command is run on the A_metadata.U 1282 | collection, the output is as follows: (in this case we look only for items with more than 200 commits: 1283 | 1284 | ----------- 1285 | ``` 1286 | mongosh --host=da1 1287 | mongosh>use WoC; 1288 | WoC> db.A_metadata.U.findOne({NumCommits:{$gt:200}}) 1289 | { 1290 | _id: ObjectId("62229b132bc6e5f0dbd0307f"), 1291 | FileInfo: { 1292 | Ruby: 2, 1293 | TypeScript: 1, 1294 | Python: 2, 1295 | Rust: 92, 1296 | PHP: 6895, 1297 | other: 2406, 1298 | Sql: 1, 1299 | JavaScript: 1384, 1300 | 'C/C++': 6, 1301 | Java: 1 1302 | }, 1303 | NumActiveMon: 19, 1304 | EarliestCommitDate: 1512136069, 1305 | ApiInfo: { 1306 | 'Rust:the': 1, 1307 | 'PY:sys': 1, 1308 | 'PY:datetime': 1, 1309 | 'Rust:on': 1, 1310 | 'PY:os': 1 1311 | }, 1312 | LatestCommitDate: 1550574037, 1313 | MonNprj: { 1314 | '2019-02': 1, 1315 | '2017-11': 5, 1316 | '2018-04': 2, 1317 | '2018-12': 2, 1318 | '2018-03': 2, 1319 | '2019-05': 1, 1320 | '2019-04': 1, 1321 | '2019-03': 1, 1322 | '2018-05': 2, 1323 | '2018-02': 3, 1324 | '2018-08': 1, 1325 | '2018-01': 1, 1326 | '2019-06': 1, 1327 | '2017-12': 4, 1328 | '2018-10': 1, 1329 | '2018-06': 1, 1330 | '2018-11': 1, 1331 | '2018-07': 1, 1332 | '2019-01': 4 1333 | }, 1334 | NumOriginatingBlobs: 2187, 1335 | AuthorID: 'Jennifer Calipel ', 1336 | MonNcmt: { 1337 | '2019-02': 9, 1338 | '2017-11': 21, 1339 | '2018-04': 13, 1340 | '2018-12': 2, 1341 | '2018-03': 29, 1342 | '2019-05': 1, 1343 | '2019-04': 2, 1344 | '2019-03': 5, 1345 | '2018-05': 9, 1346 | '2018-02': 32, 1347 | '2018-08': 42, 1348 | '2018-01': 6, 1349 | '2019-06': 2, 1350 | '2017-12': 23, 1351 | '2018-10': 1, 1352 | '2018-06': 5, 1353 | '2018-11': 1, 1354 | '2018-07': 12, 1355 | '2019-01': 17 1356 | }, 1357 | NumCommits: 232, 1358 | NumProjects: 18, 1359 | NumFiles: 10790 1360 | } 1361 | ``` 1362 | Similarly for projects: 1363 | ``` 1364 | WoC> db.P_metadata.U.findOne({NumCommits:{$gt:200}}) 1365 | { 1366 | _id: ObjectId("62228cb7e65a0aefc2ca086f"), 1367 | FileInfo: { other: 442, JavaScript: 17 }, 1368 | NumAuthors: 11, 1369 | MonNauth: { 1370 | '2020-04': 9, 1371 | '2021-07': 1, 1372 | '2020-11': 1, 1373 | '2020-07': 1, 1374 | '2021-05': 1, 1375 | '2021-08': 1, 1376 | '2020-08': 1, 1377 | '2020-03': 5, 1378 | '2020-06': 5, 1379 | '2020-12': 1, 1380 | '2020-05': 5 1381 | }, 1382 | EarliestCommitDate: 1584055325, 1383 | NumStars: 17, 1384 | NumBlobs: 709, 1385 | LatestCommitDate: 1628739252, 1386 | ProjectID: 'foss-responders_fossresponders.com', 1387 | MonNcmt: { 1388 | '2020-04': 60, 1389 | '2021-07': 2, 1390 | '2020-11': 1, 1391 | '2020-07': 1, 1392 | '2021-05': 2, 1393 | '2021-08': 2, 1394 | '2020-08': 4, 1395 | '2020-03': 48, 1396 | '2020-06': 16, 1397 | '2020-12': 3, 1398 | '2020-05': 91 1399 | }, 1400 | NumCore: 3, 1401 | NumCommits: 230, 1402 | CommunitySize: 1, 1403 | NumFiles: 459, 1404 | Core: { 1405 | 'SaptakS ': '23', 1406 | 'Awele ': '11', 1407 | 'Richard Littauer ': '157' 1408 | }, 1409 | NumForks: 11 1410 | } 1411 | ``` 1412 | And for APIs 1413 | ``` 1414 | mongosh> WoC> db.API_metadata.U.findOne({$and: [ { NumCommits:{$gt:200} }, { NumProjects: {$gt:200} }, {NumAuthors:{$gt:200}} ]}) 1415 | { 1416 | _id: ObjectId("62257192758fdfbec79e9125"), 1417 | NumAuthors: 456, 1418 | Lang: 'C', 1419 | NumProjects: 236, 1420 | NumCommits: 4366, 1421 | API: 'C:BasicUsageEnvironment.hh' 1422 | } 1423 | ``` 1424 | --------------- 1425 | 1426 | This metadata can then be parsed for the desired information. 1427 | 1428 | Python, like most other programming languages, has an interface with 1429 | Mongo that makes for data storage/retrieval much simpler. When 1430 | retrieving or inputting large amounts of data onto the servers, it 1431 | is almost always faster and easier to do so through one of the 1432 | interfaces provided. 1433 | 1434 | 1435 | ### PyMongo 1436 | 1437 | PyMongo is an import for Python that simplifies access to the database and elements inside of it. When accessing the server you must first provide which Mongo Client you wish to connect to. For our server, the host will be "mongodb://da1.eecs.utk.edu/". 1438 | This will allow access to the data already saved and will allow for creation of new data if desired. 1439 | 1440 | From there, accessing databases inside of the client becomes as simple as treating the desired database as an element inside the client. The same is true for accessing collections inside of a database. 1441 | The below code illustrates this process. 1442 | 1443 | -------- 1444 | ```python3 1445 | import pymongo 1446 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/") 1447 | 1448 | db = client["WoC"] 1449 | coll = db["A_metadata.U"] 1450 | ``` 1451 | ------- 1452 | 1453 | #### Data Retrieval using PyMongo 1454 | 1455 | When attempting to retrieve data, iterating over the entire collection for specific info is often neccesary. This is done most often through a mongo specific data structure called cursors. However, cursors have a limited life span. After roughly 10 minutes of continuous connection to the server, the cursor is forcibly disconnected. This is to limit the possible number of idle cursors connected to the server at any time. 1456 | 1457 | Taking this into consideration, if the process may take longer than that, it is neccesary to define the cursor as undying. If this is neccesary, manual disconnection of the cursor after it's served it's purpose is required as well. 1458 | The below code illustrates creation and iteration over the collection with a cursor. 1459 | 1460 | -------- 1461 | ```python 1462 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/") 1463 | 1464 | db = client["WoC"] 1465 | coll = db["A_metadata.U"] 1466 | 1467 | dataset = col.find({}, cursor_no_timeout=True) 1468 | for data in dataset: 1469 | ... 1470 | 1471 | dataset.close() 1472 | ``` 1473 | ------- 1474 | 1475 | Once data retrieval has begun, accessing the specific information desired is simple. 1476 | For example, provided above is the information saved in one element 1477 | of auth_metadata. If access to the AuthorID of each cursor is 1478 | desired, the "AuthorID" can be treated as the key in a key 1479 | value-mapping. However, it is often neccesary to consider how the 1480 | data is stored. 1481 | 1482 | Most often, when storing data in Mongo, it will be stored in Mongo 1483 | specific format called BSON. BSON objects are saved in 1484 | unicode. Working with unicode can be an issue if printing needs to 1485 | be done. As such, decoding from unicode must to be done. Below 1486 | illustrates a small program that prints each AuthorID from the 1487 | auth_metadata collection. 1488 | 1489 | ---------- 1490 | ```python 1491 | import pymongo 1492 | import bson 1493 | 1494 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/") 1495 | db = client ['WoC'] 1496 | coll = db['A_metadata.U'] 1497 | 1498 | dataset = coll.find({}, no_cursor_timeout=True) 1499 | for data in dataset: 1500 | a = data["AuthorID"].encode('utf-8').strip() 1501 | print(a) 1502 | 1503 | dataset.close() 1504 | 1505 | ``` 1506 | ---------- 1507 | 1508 | When retrieving data, it is often neccesary to narrow the 1509 | results. This is possible directly through Mongo when querying for 1510 | information. For instance, if all the data is not needed in the 1511 | auth_metadata, just the NumCommits and the AuthorID, the query can 1512 | be restricted adding parameters to the find call. An example query 1513 | is provided below. 1514 | 1515 | ---------- 1516 | ```python 1517 | dataset = coll.find({}, {"AuthorID": 1, "NumCommits": 1, "_id": 0}, no_cursor_timeout=True) 1518 | 1519 | for data in dataset: 1520 | print(data) 1521 | 1522 | dataset.close() 1523 | ``` 1524 | --------- 1525 | 1526 | This specific call allows for direct printing of the data, however, as noted above, the names are saved in BSON and as such will be printed in unicode. The first 10 results are shown below. 1527 | 1528 | ------------- 1529 | ``` 1530 | {u'NumCommits': 1, u'AuthorID': u' '} 1531 | {u'NumCommits': 0, u'AuthorID': u' <1151643598@163.com>'} 1532 | ... 1533 | ``` 1534 | -------------- 1535 | 1536 | Sometimes, restricting the data even further is neccesary. Notice above that many of the users have 0 commits. Exclusion of these entries may be desired. The below example illustrates a way to restrict the results to only users with greater than 0 commits. 1537 | 1538 | ---------- 1539 | ```python 1540 | dataset = coll.find({"NumCommits : { "$gt" : 0 } }, 1541 | {"AuthorID": 1, "NumCommits": 1, "_id": 0}, 1542 | no_cursor_timeout=True) 1543 | 1544 | for data in dataset: 1545 | print(data) 1546 | ``` 1547 | --------- 1548 | 1549 | ## Accesing by time slices 1550 | 1551 | To access collection indexed by time we use clickhouse databse: 1552 | https://clickhouse.yandex/docs/en/getting_started/tutorial/ 1553 | 1554 | 1555 | It has interfaces to various languages but the key is super fast indexing and the ability to distribute data over a cluster of da servers. 1556 | 1557 | Since only commits have time associated with them, we start from storing all commits in the database. We store only a subset of commits on each server, first we create a table local to each server (commits) and a table that represents all servers (commits_a): 1558 | ``` 1559 | for h in da0 da1 da2 da3 da4; 1560 | do echo "CREATE TABLE api_$v (api String, from Int32, to Int32, ncmt Int32, nauth Int32, nproj Int32) ENGINE = MergeTree() ORDER BY from" |clickhouse-client --host=$h 1561 | echo "CREATE TABLE api_all AS api_$v ENGINE = Distributed(da, default, api_$v, rand())" | clickhouse-client --host=$h 1562 | 1563 | echo "CREATE TABLE commit_$v (sha1 FixedString(20), time Int32, tc Int32, tree FixedString(20), parent String, taz String, tcz String, author String, commiter String, project String, comment String) ENGINE = MergeTree() ORDER BY time" |clickhouse-client --host=$h 1564 | echo "CREATE TABLE commit_all AS commit_$v ENGINE = Distributed(da, default, commit_$v, rand())" | clickhouse-client --host=$h 1565 | done 1566 | ``` 1567 | 1568 | Then we import data into each of these five tables: 1569 | ```bash 1570 | v=u 1571 | for j in {0..4} 1572 | do da=da$j 1573 | for i in $(eval echo "{$j..31..5}") 1574 | do echo "start inserting $da file $i" 1575 | time /da?_data/basemaps/gz/Pkg2stat$i.gz | ~/lookup/chImportPkg.perl | clickhouse-client --max_partitions_per_insert_block=1000 --host=$da --query 'INSERT INTO api_u FORMAT RowBinary' 1576 | done 1577 | for i in $(eval echo "{$j..127..5}") 1578 | do echo "start inserting $da file $i" 1579 | time zcat /da?_data/basemaps/gz/c2chFullU$i.s | ~/lookup/chImportCmt.perl | clickhouse-client --max_partitions_per_insert_block=1000 --host=$da --query 'INSERT INTO commit_u FORMAT RowBinary' 1580 | done 1581 | done 1582 | ``` 1583 | 1584 | Once the data is in there we can query commits 1585 | ```bash 1586 | clickhouse-client --host=da3 --query 'select count (*) from commits_all' 1587 | 2061780191 1588 | ``` 1589 | or APIs: 1590 | ```bash 1591 | echo "select api,ncmt, nauth, nproj from api_all where match(api, 'stdio') and nauth > 100 limit 3 FORMAT CSV" |clickhouse-client --host=da3 --format_csv_delimiter=";" 1592 | "C:stdio_ext.h";153898;4797;1671 1593 | "C:ustdio.h";9995;1107;230 1594 | "C:vcl_cstdio.h";5868;163;9 1595 | ``` 1596 | 1597 | 1598 | It works fast if we specify specific time or an interval: 1599 | ```bash 1600 | clickhouse-client --host=da3 --query 'select author,comment from commits_all where time=1568656268' 1601 | Matt Davis Made some SEO improvements and also added comments outlining what is contained in each section.\n 1602 | Jessie 1307 <295101171@qq.com> First Commit\n 1603 | � 1604 | �� <910063@gmail.com> 0917\n 1605 | nodemcu-custom-build Prepare my build.config for custom build 1606 | zzzz1313 Initial commit 1607 | Erik Faye-Lund .mailmap: add an alias for Sergii Romantsov\n 1608 | Paulus Pärssinen Initial commit 1609 | AnnaLub get all tickets command impl\n 1610 | ``` 1611 | 1612 | We may want to match on commit comment 1613 | ```bash 1614 | echo "select lower(hex(sha1)),author, project, comment from commit_all where match(comment, 'CVE-2021') limit 3 FORMAT CSV" |clickhouse-client --host=da3 --format_csv_delimiter=";" 1615 | "Florian Westphal ";"Jackeagle_kernel_msm-3.18";"netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore\nxt_replace_table relies on table replacement counter retrieval (which__NEWLINE__uses xt_recseq to synchronize pcpu counters).\nThis is fine, however with large rule 1616 | set get_counters() can take__NEWLINE__a very long time -- it needs to synchronize all counters because__NEWLINE__it has to assume concurrent modifications can occur.\nMake xt_replace_table synchronize by itself by waiting until all cpus__NEWLINE__had an even seqcount.\nThis allows a followup patch to copy the cou 1617 | nters of the old ruleset__NEWLINE__without any synchonization after xt_replace_table has completed.\nCc: Dan Williams __NEWLINE__Reviewed-by: Eric Dumazet __NEWLINE__Signed-off-by: Florian Westphal __NEWLINE__Signed-off-by: Pablo Neira Ayuso \n(cherry picked from commit 80055dab5de0c8677bc148c4717ddfc753a9148e)__NEWLINE__Orabug: 32709122__NEWLINE__CVE: CVE-2021-29650__NEWLINE__Signed-off-by: Sherry Yang __NEWLINE__Reviewed-by: John Donnelly __NEWLINE__Signed-off-by: Somasundaram Krishnasamy " 1620 | "Joe Yu ";"daedroza_aosp_development_sony8960_n";"Fix storaged memory leak\nCVE-2021-0330 : (AOSP) EoP Vulnerability in Framework / storaged__NEWLINE__A-170732441__NEWLINE__Mot-CRs-fixed: (CR)\nstoraged try to load user's proto even if it has been loaded before\nhttps://partnerissuetracker.corp 1621 | .google.com/u/2/issues/118719575\nChange-Id: Ia7575cdc60e82b028c6db9a29ae80e31e02268b1__NEWLINE__(cherry picked from commit 857a63eb6604baa1ed6b0e31839ccce8da18c716)__NEWLINE__Signed-off-by: Mark Salyzyn __NEWLINE__Bug: 170732441__NEWLINE__Test: compile__NEWLINE__(cherry picked from commit 8ec 1622 | 2afb91400818b0a8843b8917c05aba75b00db)__NEWLINE__Reviewed-on: https://gerrit.mot.com/1843719__NEWLINE__SLTApproved: Slta Waiver__NEWLINE__SME-Granted: SME Approvals Granted__NEWLINE__Tested-by: Jira Key__NEWLINE__Reviewed-by: Konstantin Makariev __NEWLINE__Submit-Approved: Jira Key" 1623 | "Joe Yu ";"daedroza_aosp_development_sony8960_n";"Fix storaged memory leak\nCVE-2021-0330 : (AOSP) EoP Vulnerability in Framework / storaged__NEWLINE__A-170732441__NEWLINE__Mot-CRs-fixed: (CR)\nstoraged try to load user's proto even if it has been loaded before\nhttps://partnerissuetracker.corp 1624 | .google.com/u/2/issues/118719575\nChange-Id: Ia7575cdc60e82b028c6db9a29ae80e31e02268b1__NEWLINE__(cherry picked from commit 857a63eb6604baa1ed6b0e31839ccce8da18c716)__NEWLINE__Signed-off-by: Mark Salyzyn __NEWLINE__Bug: 170732441__NEWLINE__Test: compile__NEWLINE__(cherry picked from commit 8ec 1625 | 2afb91400818b0a8843b8917c05aba75b00db)__NEWLINE__Reviewed-on: https://gerrit.mot.com/1844255__NEWLINE__SLTApproved: Slta Waiver__NEWLINE__SME-Granted: SME Approvals Granted__NEWLINE__Tested-by: Jira Key__NEWLINE__Reviewed-by: Konstantin Makariev __NEWLINE__Submit-Approved: Jira Key" 1626 | ``` 1627 | commit sha1's are binary, so to print them we need to process, e.g., 1628 | ```bash 1629 | clickhouse-client --host=da1 --query 'select sha1, author,comment from commits_all where time=1568656268 limit 1 format RowBinary' | perl -ane '$sha1=substr($_, 0, 20); $o=unpack "H*", $sha1; $rest=substr($_,21,length($_)-21); print "$o;$rest\n";' 1630 | fbb7add2a58b733a797d97a1e63cb8661702d0a3;zzzz1313 Initial commit 1631 | ``` 1632 | Alternatively, we can hex them in the select statement: 1633 | ```bash 1634 | clickhouse-client --host=da1 --query "select lower(hex(sha1)),author,comment from commits_all where match(comment, '^(CVE-(1999|2\d{3})-(0\d{2}[0-9]|[1-9]\d{3,}))$') limit 2 format CSV" 1635 | "024fbd8de50c1269d178c3ee6b8664f5eee7f57b","nickmx1896 ","CVE-2016-2355" 1636 | "209446bab86e996d58c233abee0376cb26dcd4c4","jonathanliem94 ","CVE-2017-4963" 1637 | ``` 1638 | 1639 | We can create additional tables, so that the time filtering could be fast, for example for projects: 1640 | 1641 | ``` 1642 | clickhouse-client --max_partitions_per_insert_block=1000 --host=da3 --query "CREATE TABLE c2p_$i (date Date, sha1 FixedString(20), np UInt32, p String) ENGINE = MergeTree(date, sha1, 8192)" 1643 | for j in {3..31..4}; do time ./importc2p.perl $j | clickhouse-client --max_partitions_per_insert_block=1000 --host=da3 --query "INSERT INTO c2p_$j (date, sha1, np, p) FORMAT RowBinary"; done 1644 | ``` 1645 | 1646 | 1647 | 1648 | ## Python Clickhouse API 1649 | 1650 | CH API is disabled in the current version of oscar: make a 1651 | separate module - see draft in lookup/oscarch.py) 1652 | 1653 | 1654 | There are classes in oscar.py that allow for querying the clickhouse database: 1655 | 1. `Time_commit_info(tb_name='commits_all', db_host='localhost')` - commits 1656 | * `.commit_counts(start, end=None)` - get the count of the commits given a time interval 1657 | * `.commits_iter(start, end=None)` - get the commits as 'Commit' objects in a generator 1658 | * `.commits_shas(start, end=None)` - get the sha1 of the commits in a list 1659 | * `.commits_shas_iter(start, end=None)` - get the sha1 of the commits in a generator 1660 | 2. `Time_projects_info(tb_name='b2cPtaPkgR_all', db_host='localhsot')` - \*projects 1661 | * `.get_values_iter(cols, start, end)` - query columns given a time interval (generator) 1662 | * `.project_timeline(cols, repo)` - query columns of a given project name, sorted by time (generator) 1663 | * `.author_timeline(cols, author)` - query columns of a given author, sorted by time (generator) 1664 | 1665 | \*note that the *b2cPtaPkgR_all* table currently does not contain projects that uses the following programming languages: php, Lisp, Sql, Fml, Swift, Lua, Cob, Erlang, Clojure, Markdown, CSS 1666 | 1667 | The structures of the databases are listed below: 1668 | **commits_all:** 1669 | |__name___|______type_______| 1670 | | sha1 | FixedString(20) | 1671 | | time | Int32 | 1672 | | timeCmt | Int32 | 1673 | | tree | FixedString(20) | 1674 | | parent | String | 1675 | | TZAuth | String | 1676 | | TZCmt | String | 1677 | | author | String | 1678 | | commiter| String | 1679 | | project | String | 1680 | | comment | String | 1681 | 1682 | **b2cPtaPkgR_all:** 1683 | | name | type | 1684 | |----------|-----------------| 1685 | | blob | FixedString(20) | 1686 | | commit | FixedString(20) | 1687 | | project | String | 1688 | | time | UInt32 | 1689 | | author | String | 1690 | | language | String | 1691 | | deps | String | 1692 | 1693 | We can use the `commit_counts` to query the count of the commits given a time interval: 1694 | ```python 1695 | >>> from oscarch import Time_commit_info 1696 | >>> 1697 | >>> t = Time_commit_info() 1698 | >>> t.commit_counts(1568656268) 1699 | 9 1700 | ``` 1701 | The `commits` method can be used to iterate through commit objects given a time interval: 1702 | ```python 1703 | >>> t = Time_commit_info() 1704 | >>> commits = t.commits(1568656268) 1705 | >>> c = commits.next() 1706 | >>> type(c) 1707 | 1708 | >>> c.parent_shas 1709 | ('9c4cc4f6f8040ed98388c7dedeb683469f7210f5',) 1710 | ``` 1711 | The `commits_shas` method can be used to iterate through commit hashes given a time interval: 1712 | ```python 1713 | >>> t = Time_commit_info() 1714 | >>> for sha1 in t.commits_shas(1568656268): 1715 | ... print(sha1) 1716 | 0a8b6216a42e84d7d1e56661f63e5205d4680854 1717 | 874d92e732d79d0d8bafb1d1bcc76a3b6d81302f 1718 | ccf1a5847661de2df791a5a962e3499a478f48ab 1719 | 39927c70a99f0949c1de3d65a2693c8768bc4e0f 1720 | fbb7add2a58b733a797d97a1e63cb8661702d0a3 1721 | ... 1722 | ``` 1723 | 1724 | The `b2cPtaPkgR_all` table stores the information associated to each **commit**. 1725 | For `b2cPtaPkgR_all` table, use `get_values_iter` of the `Time_projects_info` class to queries for columns in a given time interval: 1726 | ```python 1727 | >>> from oscar import Time_project_info as Proj 1728 | >>> p = Proj() 1729 | >>> rows = p.get_values_iter(['time','project'], 1568571909, 1568571910) 1730 | >>> for row in rows: 1731 | ... print(row) 1732 | ... 1733 | (1568571909, 'mrtrevanderson_CECS_424') 1734 | (1568571909, 'gitlab.com_surajpatel_tic_toc_toe') 1735 | (1568571909, 'gitlab.com_surajpatel_tic_toc_toe') 1736 | ... 1737 | ``` 1738 | `project_timeline` can be used to query for a specific repository. The result shows the time of the commit and the name of the commit repo sorted by time: 1739 | ```python 1740 | >>> rows = p.project_timeline(['time','project'], 'mrtrevanderson_CECS_424') 1741 | >>> for row in rows: 1742 | ... print(row) 1743 | ... 1744 | (1568571909, 'mrtrevanderson_CECS_424') 1745 | (1568571909, 'mrtrevanderson_CECS_424') 1746 | (1568571909, 'mrtrevanderson_CECS_424') 1747 | ... 1748 | ``` 1749 | 1750 | It might be useful to examine the dependencies (i.e. includes in C or imports in Python) for each commit. 1751 | The snippet below shows the time, repo name, language, and dependencies for each commit. Note that the commits are sorted by time and the dependencies are separated by semicolon. 1752 | ```python 1753 | >>> rows = p.get_values_iter(['time', 'project', 'language', 'deps'], 1568571915, 1568571916) 1754 | >>> for row in rows: 1755 | ... print(row) 1756 | ... 1757 | (1568571916, 'Nakwendaa_neural-network', 'PY', 'numpy\n') 1758 | (1568571916, 'Nakwendaa_neural-network', 'PY', 'os;pickle;numpy;time;matplotlib.pyplot;gzip;Mlp.Mlp\n') 1759 | ``` 1760 | Similarily, `author_timeline` queries for a specific author: 1761 | ```python 1762 | >>> rows = p.author_timeline(['time', 'project'], 'Andrew Gacek ') 1763 | >>> for row in rows: 1764 | ... print(row) 1765 | ... 1766 | (49, 'smaccm_camera_demo') 1767 | (677, 'smaccm_vm_hack') 1768 | (1180017188, 'teyjus_teyjus') 1769 | ... 1770 | ``` 1771 | 1772 | # Considerations on performance 1773 | 1774 | 1. getValues and showCnt are not supposed to be called every second, 1775 | the keys are passed through from standard input and one line is 1776 | generated for each key (get values also passes attributes, 1777 | ``` 1778 | echo "k;a" | getValues k2v 1779 | ``` 1780 | produces one line 1781 | ``` 1782 | k;a;v0;v1;..vn 1783 | ``` 1784 | For blobs, it is possible to export the entire content as a single 1785 | bas64 encoded line 1786 | 1787 | 2. Operations that require iteration over all keys or values (e.g., match a pattern) over all 1788 | is faster via flat files 1789 | ``` 1790 | for i in {0..127}; do zcat /da?_data/basemaps/gz/k2vFullUi.s; done | grep PATTERN 1791 | ``` 1792 | If the iteration is over commit content, use 1793 | ``` 1794 | cd /da5_data/All.blobs/ 1795 | for i in {0..127}; do perl ~/lookup/lstCmt.perl 9 $i; done 1796 | ``` 1797 | If iteration is over blob content 1798 | 1799 | ``` 1800 | cd /da5_data/All.blobs/ 1801 | for i in {0..127}; do perl ~/lookup/lstBlob.perl $i; done 1802 | ``` 1803 | 1804 | 3. For a very large number of exact keys (over 500K) it is faster to use unix join 1805 | (simply split (via splitSec.perl for hashes and splitSecCh.perl for strings), sort each piece 1806 | and use unix join: 1807 | ``` 1808 | for i in {0..127}; do zcat /da?_data/basemaps/gz/k2vFullUi.s |join -t\; <(zcat piece$i) -; done 1809 | ``` 1810 | -------------------------------------------------------------------------------- /ShellGuide.md: -------------------------------------------------------------------------------- 1 | Hello and welcome to World of Code! 2 | 3 | World of Code can be navigated using the Linux shell. 4 | Let’s go over some of the most commonly used Linux shell commands together! 5 | 6 | As a quick note this guide assumes you have used secure shell to connect to da0 by typing “ssh da0” on the appropriate command line as previously 7 | stated in this World of Code tutorial and have done nothing else. 8 | 9 | Once you are here you will be greeted by a prompt similar to this: “[wparham1@da0]~%” except instead of wparham1 you will see the username 10 | registered to you during World of Code signup! 11 | 12 | Let’s start learning how to navigate World of Code by using the shell. If you are new to using a Linux terminal, I encourage you to follow along or follow 13 | this link "https://www.hostinger.com/tutorials/linux-commands" to further familiarize yourself with shell commands. 14 | 15 | *Note: In this tutorial I will type all commands in quotes but you should not include these quotes when operating with the shell unless specified otherwise. 16 | 17 | Now that we are on the same page, let’s begin by typing the command “ls” without the quotes and pressing enter. 18 | • The “ls” command stands for “list” and it will list all contents of a directory. 19 | o The command can be further specified by using flags after the command. For instance, let’s type the long flag denoted by “ls -l” next. 20 | This will give us a more verbose output telling us the privileges on the file in the first column, then the number of links to the file in 21 | the second column, the owner of the file in the third column, then the group of people to who own the file in the fourth column, 22 | the size of the file in bytes in the fifth column, the date the file was last changed in that directory in the sixth, seventh, and eight column, and 23 | finally the file name in the ninth column. 24 | 25 | o Now, let’s try typing the “all” flag denoted by “ls -a”. This will show all contents of a directory including files Linux would 26 | normally hide from the user such as “.” And “..”. 27 | 28 | o Now, let’s combine both the “ls -l” flag and the “ls -a” flag. This will result in the verbose, tabular listing of every file inside a directory. 29 | This is meant to demonstrate that you can use any valid combination of flags together with any given Linux command! 30 | 31 | Now that you have closely examined a most likely blank directory lets go ahead and talk about creating files and directories from the command line. 32 | • The first and arguably most useful command when creating files will be “mkdir [directory name]”. 33 | Use this command for yourself by replacing the [directory name] with whatever you would like to name this directory. 34 | For this example, I recommend naming it “shell_tutorial”. That means the command would look like this: “mkdir shell_tutorial”. 35 | To confirm that you created the directory use the “ls” command we talked about above. Type “ls” into the terminal and press enter to 36 | see your new directory, good job! 37 | 38 | • Another common way to create files is by using the Vim text editor. The Vim text editor, while extremely handy, can be a little tricky to get used to. 39 | First, type “vim [file name]” into the terminal and press enter. Again, feel free to replace [file name] with whatever you want 40 | but I recommend replacing it with “vim_tutorial.txt”. In that case, the command would look like this: “vim vim_tutorial.txt”. 41 | This will open up a new window which will be the file you are editing! To edit the file first press the “i” key. 42 | This tells vim to swap to text editor mode instead of edit mode. Now type “This is a blank vim file.” and press the escape key on your keyboard. 43 | The escape key tells Vim to take you out of text editor mode and puts you in edit mode. Next, type “:wq”. In vim, the “w” stands for save work and 44 | the q stands for quit. If you ever desire to quit without saving then type the command “:q!”. 45 | 46 | • Another extremely useful and extremely common way to create a file is to copy it from an existing file! 47 | This can be done by using the “cp” or “copy command”. For instance, if we wanted to make a copy of “vim_tutorial.txt” named “vim_tutorial_copy.txt” all 48 | we would have to do is type “cp vim_tutorial.txt vim_tutorial_copy.txt” and make sure that the file we want to copy is in the same directory we are in. 49 | You can create copies of a file from a different directory by specifying the path to the directory you wish to copy the file from. For instance, if 50 | I am inside the directory “shell_tutorial” and I want to copy a file named “cool_file.txt” but it is one directory higher than “shell_tutorial” all 51 | I would have to do is type: “cp ../cool_file.txt cool_file_copy.txt” 52 | 53 | o *Note: You should be careful when creating a copy of a file as it copies the entire contents of a file so if you have very big files, you will be 54 | making very large copies. 55 | • The final command I would like to talk about here is the “mv” command which stands for “move command”. 56 | This command lets you move a file from one location to another location or even rename a file! For instance, if I wanted to move “vim_tutorial.txt” to 57 | the directory above “shell_tutorial” I could navigate inside of “shell_tutorial” using the “cd” command then type the following command: 58 | “mv vim_tutorial.txt ..”. In this command the first parameter specifies the file in question and the second parameter specifies the location in question. 59 | If I wanted to move “vim_tutorial.txt” back inside of “shell_tutorial” all I would have to do is type: “mv ../vim_tutorial.txt .”. 60 | In this case, the first argument is the path to the file I wish to move and the second argument is the “.” (dot) specifying I want the file moved to 61 | this directory. 62 | 63 | o The final note I want to make on the “mv” command is its ability to rename files. In order to do this all you need to do is type: 64 | “mv [name of file to be changed] [new name]”. If I wanted to change the name of “vim_tutorial.txt” to “awesome_vim.txt” I would type: 65 | mv “vim_tutorial.txt” “awesome_vim.txt”. 66 | 67 | Now that we have created a file a few different ways, we can admire our hard work! To do this let’s use the “cat”, “head”, and “tail” commands! 68 | • First, let’s use “cat”. This command stands for concatenate and will print the contents of a file to “standard out” (file descriptor 1) which is a fancy 69 | way to say it will print to terminal. To use cat on the file we created type the following command: “cat vim_tutorial.txt”. 70 | This will display the contents of the file “vim_tutorial.txt”. 71 | 72 | o *Note: You should be careful when using “cat” as it can quickly flood your terminal with output if you cat a file that is too large. 73 | 74 | • Second, let’s use “head” to display the content of our “vim_tutorial.txt” file. “Head” stands for the “head” of a file. 75 | It will print the first 10 lines of a file to standard out (stdout). In order to use it on the file we just created type: “head vim_tutorial.txt”. 76 | 77 | o Head also has flags associated with it the same way “ls” has the “-l” and “-a” flags. Head lets the user specify how many lines from the top of the 78 | file the user would like to see. For instance, if we type “head -5 vim_tutorial.txt” only the first five lines of “vim_tutorial.txt” will be shown! 79 | This is especially useful if you have a file so large that opening the file and looking at the first N arbitrary number of lines will take a long time 80 | as can be the case in World of Code. The best part is that you can also specify a number of lines greater than 10 for head to print. 81 | For instance, “head -1000 vim_tutorial.txt” would print the first one thousand lines of “vim_tutorial.txt” if it had that many lines! 82 | 83 | • Finally, let’s use “tail” to display the content of our “vim_tutorial.txt” file. “Tail” works like the opposite of head and stands for tail of file. 84 | It will try to print the final 10 lines of any given file. In order to print “vim_tutorial.txt” go ahead and enter “tail vim_tutorial.txt”. 85 | 86 | o Tail also has the ability to specify how many lines of output you desire from thee end of your file. To do this use the -x flag where x is a 87 | given number. For instance, typing “tail -5 vim_tutorial.txt” would print the final five lines of “vim_tutorial.txt”. 88 | 89 | Now that we know how to create files and directories let’s learn how to traverse these directories! 90 | The simplest way to traverse directories is by moving through them one directory at a time. 91 | To do this we can make use of the “cd” command commonly referred to as the “Change Directory” command. 92 | 93 | • First, let’s navigate into the directory we created earlier! In order to do this, we need to type “cd [name of existing directory]”. 94 | If you named the directory “shell_tutorial” then the command would look like this: “cd shell_tutorial”. 95 | Once you are in this directory you can once again make use to “ls” to see what is in it. In this case I recommend using 96 | the “-a” flag (“ls -a”) in order to see the “.” And “..” folder. These folders hold valuable information for traversing directories. 97 | The singular dot “.” Means “this directory” while the double dots mean “previous directory”. 98 | 99 | • Next, let’s navigate back to the directory we started in. We can do this by typing “cd ..”. 100 | This command will move us back up one directory. 101 | 102 | • Next let’s understand how the tilde (~) character works. In Linux the tilde character will try to fill in relevant path information if it has not 103 | been specified. This can be used in a variety of ways including fast directory traversal. I will cover it more as it becomes relevant. 104 | 105 | • *Note: You can use the “cd ..” command from your home directory to move up a level and see everyone on that da server. 106 | For instance, I am usually on da0. If you wanted to move to my directory you could simply ssh into da0 and type “cd ../wparham1/” to see my home 107 | directory and its contents. If you wanted to go back to your directory from this location all you would need to do is type “cd ../[your username]/” 108 | 109 | Now we can talk about deleting files and directories. To do this we will use two commands: rmdir and rm. 110 | • Starting with the rmdir command first, it can be used to remove empty directories. To test this let’s go ahead and create a directory named 111 | “short_lived_dir”. To do this type “mkdir short_lived_dir”. To make sure that the directory is there you can type “ls”. 112 | Next, in order to delete the empty directory, we will type “rmdir_short_lived_dir”. Upon checking using “ls” we can see that the directory 113 | “short_lived_dir” no longer exists! 114 | 115 | • Next, lets look at using the rm command. This command can be used to both delete directories and files but let’s look at how to delete files first. 116 | o To delete a file simply type “rm [filename]”. However, you need to be careful when you do this because there is no undo. 117 | Deleting anything in the shell is usually a permanent action. For demonstration purposes let’s go ahead and create a copy of “vim_tutorial.txt” 118 | named “short_lived_tutorial.txt”. To do this go to the directory with “vim_tutorial.txt” and type: “cp vim_tutorial.txt short_lived_tutorial.txt”. 119 | Next, we can use “ls” to check if our copy exists. After confirming its existence, lets delete it. 120 | To do this all you will need to enter is “rm short_lived_tutorial.txt”. This will permanently delete the copy. 121 | 122 | o To delete a directory with contents we will need to use the rm command with the -r flag. This specifies a recursive directory traversal to 123 | delete all content inside the directory. To illustrate this let’s make a copy of the “shell_tutorial” directory. 124 | We can do this by typing “cp -r shell_tutorial short_lived_tutorial”. The -r flag is specifying a recursive copy allowing us to copy the entire 125 | directory and its contents! Now, to delete this directory we need to type: “rm -r short_lived_tutorial”. 126 | This will immediately delete the directory specified and no longer grant access to any files it contained. 127 | 128 | *Note: You should ALWAYS double check that you are deleting the correct directory when you specify the -r flag. If you specify the wrong directory you could delete your entire home directory losing all work that isn’t saved elsewhere on a different machine, GitHub repo, etc. For this reason, you should never, under any circumstance, use the “.” or “..” folders when specifying a directory to delete. It is much safer to navigate to the directory you want to delete and specify the directory name. 129 | 130 | The next command I would like to talk about is the “grep” command. The grep command allows a user to search through a file for a given string. 131 | • For instance, if you wanted to search for the word “blank” inside of “vim_tutorial.txt” you could type 132 | “grep blank vim_tutorial.txt”. This will print all lines that contain the string “blank” to standard out. 133 | This command is extremely useful for searching and filtering when using World of Code particularly in x2y type mappings. 134 | 135 | • You can also combine grep with regex expressions to expand the range of what you could possibly grep. 136 | A great example of this is: grep -iE ';code\W?(of)?\W?conduct'. In this the -i flag specifies to ignore the case of the grepped expression and 137 | the -E flag allows you to grep with a regex expression. 138 | 139 | The next useful general purpose command is “wc” which is short for “word count”. 140 | Word count allows you to see how many lines, words, and bytes are in a file. This, much like head and tail, is particularly useful for the 141 | potentially huge files you can encounter when using World of Code (8 gb of project names anyone? I digress). 142 | • Thankfully, “wc” is an extremely intuitive command to use. If we call “wc” without any flags it will return to us the number of lines, words, and bytes 143 | in a file in that order. For instance, “wc vim_tutorial.txt” will result in the following output: “1 5 26 vim_tutorial.txt”. 144 | 145 | • We can also specify whether we want the number of lines using “-l”, the number of words using “-w”, and the number of characters using “-c”. 146 | For instance, “wc -l vim_tutorial.txt” will result in the following output: “1 vim_tutorial.txt”. 147 | 148 | Also worth mentioning is the “clear” command. Many times, you will accidentally flood your terminal with output by catting a file that was too large or 149 | running a command to stdout that should have been redirected to a file. When you do this it can be helpful to clear that output and get a clean terminal 150 | screen so you can better keep track of where you are in a directory, the last command you executed, etc. 151 | To do this all you need to do is type “clear” into the terminal and press enter. This will give you a completely clean terminal to work in. 152 | 153 | The next thing we will look at is simple output redirection. With this anything that is printed to the terminal or stdout can be redirected to a file. 154 | This can be extremely useful if you don’t want to do formatted write to file calls in a programming language and instead would rather just write to standard 155 | out and redirect it to a file! 156 | 157 | • To understand this let’s start with an example. Since we know how to cat a file to print its contents to standard out we can start there. 158 | First, navigate to the file holding “vim_tutorial.txt” and then cat it by entering “cat vim_tutorial.txt”. 159 | Once you have done that take it a step further and enter “cat vim_tutorial.txt > my_new_file.txt”. 160 | Upon executing this command, you may notice that there is no output to the terminal. This is because we have redirected 161 | the output to the file specified after the “>”. The file after the “>” will always be created or overwritten depending on the file’s 162 | previous existence. This can be important because it means you can very easily overwrite a file you have already created if you redirect into 163 | a file with the same name. Watch out for this. 164 | 165 | Now we will cover a few complicated but equally useful examples that pertain to World of Code specifically. 166 | • The first thing we will cover is how to retrieve a list of all projects and deforked projects from version U of World of Code. 167 | o This can be done by entering the following command on the command line: “zcat /da?_data/basemaps/gz/p2PU.s ”. 168 | This will create a comprehensive list of all projects and map to their deforked counter parts separated by a “;”. 169 | This makes it convenient to tokenize each line using your preferred programming language to look at each individual piece. 170 | This also makes it convenient to use the “cut” command in linux to only grab the portion of the line you are interested in. 171 | As a quick note this does not guarantee that each project is unique. If you need a list of unique project you should pipe either your cut version 172 | or the full version into the appropriate “uniq” Linux filter. 173 | 174 | *Note: An example of this will be added at a later date 175 | 176 | • The next thing we will cover is how to query and export a list from the MongoDB database contained in World of Code. 177 | 178 | o First, enter MongoDB. If you are on da1 you simply type “mongo”. If you are on a different server you can type “mongo –host “da1.eecs.utk.edu”. 179 | Next, specify that we want to use World of Code by entering “use WoC”. Now we can either interpret from A_metadata or P_metadata. 180 | A_metadata stands for author meta data and P stands for project meta data. In this example we will be using Project meta data or “P_metadata”. 181 | To understand how to create this mongo export we can look at the following command: 182 | "mongoexport -h da5 -d WoC -c P_metadata.U -f ProjectID,NumActiveMon,NumAuthors,NumCommits,Gender -o dump_with_gender.csv --type=csv" 183 | 184 | o While this command is very long it is thankfully quite easy to understand once it is broken down. 185 | First, mongoexport tells the MongoDB database that we want to export a list from it. The “-h” specifies that we want the host to be da5. 186 | The “-d” specifies the WoC database. The “-c” specifies the collection as P_metadata.U, and the -f specifies the fields we desire. 187 | After the -f you should enter the fields you would like to be exported in that order without spaces. 188 | The “-o” specifies a file to write to, in this case it is “dump_with_gender.csv”. 189 | Finally, the “--type=csv” specifies that we want this file in a csv or comma separated value format. 190 | This is the same way Excel files are formatted for those with Excel experience! 191 | 192 | o A few important things to note about how mongo generates these files. If mongo does not have all the information necessary to populate each field it 193 | will fill the field in with ‘’. This is important if you want to tokenize on commas and look at each value. 194 | Also, when you specify gender sometimes Mongo will give you the number of females, males, or both. It tries to give you all the information it 195 | has but the amount of information can be inconsistent. It is good to be prepared for this issue. 196 | Lastly, when you query Mongo in this way it will give you a result for every project it has meaning it will return a very large file. 197 | Be prepared to wait a few minutes while it is generating this file and be prepared to interpret this file. 198 | In many scenarios you cannot view it as just another Excel file because it will be too large for the Excel grid. 199 | I have also had it crash my Visual Studios when trying to view it even after assigning an appropriate amount of memory. 200 | 201 | • Finally, I would like to look at how we can fetch files off of World of Code without using a third party like a GitHub repoitory especially if the 202 | file is too big for GitHub and GitHub’s Large File Storage (LFS). 203 | 204 | o To do this we can use the “scp” command from the terminal on our local pc. This means that we do not run the following command on the WoC servers, 205 | we run it on our local machine. This is an example of me using scp on my command line: "scp wparham1@da0:~wparham1/dump_with_gender.csv ~/Downloads/" 206 | 207 | o The “scp” command stands for “secure copy”. In order to use it you will need ssh permission to the location you wish to copy from, in this case World 208 | of Code. To use this command first specify your username then @ the server you wish to connect to. 209 | The second parameter is the path of the file you wish to copy. In my case this would be “~wparham1/dump_with_gender.csv”. 210 | The second parameter is the location on your local computer you wish to copy this file to. In my case this would be “~/Downloads/”. The scp command in a 211 | generalized form would be as follows: scp [WoC username]@[WoC server you ssh into] [path to file you want to download] [path to location you want to download to]. 212 | -------------------------------------------------------------------------------- /wochardware.md: -------------------------------------------------------------------------------- 1 | # WoC Hardware 2 | 3 | |hostame|CPU|RAM|HDD|SSD| 4 | |--|--|--|--|--| 5 | |da[0-2]| Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz - 24 cores | 396GB RAM | 35T HDD|| 6 | |da3| Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00GHz - 8 cores | 396GB RAM | 70T HDD |15TBSDD| 7 | |da4|Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz - 16 cores |792GB RAM |90T HDD |15TBSDD| 8 | |da5|1Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz - 80 cores|1.320TB RAM|124T HDD | 48TB SDD| 9 | |da6|Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz - 40 cores + 4 Nvidia Tesla V100-SXM2-32GB|256GiB RAM||1.9TSSD| 10 | |bb0[^1]|||305TB HDD|| 11 | |bb1[^1]|||471TB HDD|| 12 | |da7[^1]|Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz x2= 32 HW threads|384GiB RAM|655TB HDD ZFS ~550TB usable in 45-wide raidz2|| 13 | |da8[^1]|Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz x2= 32 HW threads|384GiB RAM|655TB HDD ZFS ~510TB usable in 3x15-wide raidz2|| 14 | 15 | [^1]: These systems are primarily for storage and NOT recommended for running jobs. Access may not be available to all users, and they do not use the same mount points. 16 | --------------------------------------------------------------------------------