├── Assets
    ├── Database-workflow.png
    ├── Database.png
    ├── WoC_Logo.png
    └── reuse_DFD.png
├── Database.dia
├── LICENSE
├── README.md
├── ShellGuide.md
└── wochardware.md


/Assets/Database-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/Database-workflow.png


--------------------------------------------------------------------------------
/Assets/Database.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/Database.png


--------------------------------------------------------------------------------
/Assets/WoC_Logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/WoC_Logo.png


--------------------------------------------------------------------------------
/Assets/reuse_DFD.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Assets/reuse_DFD.png


--------------------------------------------------------------------------------
/Database.dia:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woc-hack/tutorial/e5223cfddb651d7b6a6f537052a45c5b5be9c9e0/Database.dia


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | World of Code Infrastructure includes 
 2 | =====================================
 3 | a) Copyright (c) (2018, 2019, 2020, 2021, 2022, 2023, 2024) Audris Mockus  
 4 | World of Code software used to discover retrieve, clean, cross-reference, 
 5 | query, and analyse open source version control data is licensed under Mulan PSL v2.
 6 | You can use this software according to the terms and conditions of the Mulan PSL v2.
 7 | You may obtain a copy of Mulan PSL v2 at:  http://license.coscl.org.cn/MulanPSL2
 8 | THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
 9 | EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
10 | MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
11 | See the Mulan PSL v2 for more details.
12 | 
13 | =====================================
14 | 
15 | b) The metadata including the collections and relationships among the source code 
16 | is licensed under Creative Commons Attribution 4.0 International license. Please see full details 
17 | at: https://creativecommons.org/licenses/by/4.0/
18 | 
19 | Furthermore, by accessing the data in WoC infrastructure, you agree with the Ethical Charter for using the 
20 | archive data (see, e.g., https://www.softwareheritage.org/legal/users-ethical-charter/).
21 | 
22 | =====================================
23 | 
24 | c) The actual source code in the collection retains the original license  
25 | of the specific source code/version. 
26 | 
27 | =====================================
28 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Tutorial: World of Code (WoC) Basics [DEPRICATED]
   2 | 
   3 | ![WoC_logo](/Assets/WoC_Logo.png)
   4 | 
   5 | # Please see new tutorial docs [here..](https://worldofcode.org/docs/#/)
   6 | 
   7 | ## Overview
   8 | 
   9 | World of Code (WoC) is a large-scale infrastructure designed to mine and analyze the entirety of open-source software ecosystems. It aggregates data from millions of repositories across various platforms (e.g., GitHub, GitLab) and provides cross-references between authors, projects, commits, and files. This enables researchers and developers to study software development trends, dependencies, and code sharing at a global level. WoC is essential for researchers looking to examine the evolution and structure of open-source ecosystems, supporting analyses in software supply chains, developer behavior, and code reuse.
  10 | 
  11 | ## Important Links
  12 | 
  13 | 1. [New Tutorial Documentation](https://worldofcode.org/docs/#/)
  14 | 
  15 | 2. [WoC Registration Form](https://docs.google.com/forms/d/e/1FAIpQLSd4vA5Exr-pgySRHX_NWqLz9VTV2DB6XMlR-gue_CQm51qLOQ/viewform?vc=0&c=0&w=1&flr=0&usp=mail_form_link): To request access to our servers
  16 | 
  17 | 3. [WoC Structure and Its Elements Video](https://youtu.be/c0uFPwT5SZI)
  18 | 
  19 | 4. [Tutorial Recording from 2022-10-27](https://drive.google.com/file/d/1ytzOiOSgMpqOUm2XQJhhOUAxu0AAF_OH/view?usp=sharing) and [Older Tutorial Recoding (possibly obsolete) from 2019-10-15](https://drive.google.com/file/d/14tAx2GQamR4GIxOc3EzUXl7eyPKRx2oU/view?usp=sharing) 
  20 | 
  21 | 5. [WoC Website](https://worldofcode.org)
  22 | 
  23 | 6. [WoC Discord](https://discord.gg/fKPFxzWqZX): Get updates or ask questions related to WoC
  24 | 
  25 | ## Additional Resources
  26 | 
  27 | 1. [WoC Shell Guide](https://github.com/woc-hack/tutorial/blob/master/ShellGuide.md): A brief guide on how to use bash and other related tools
  28 | 
  29 | 2. [Unix Tools: Data, Software and Production Engineering](https://courses.edx.org/courses/course-v1:DelftX+UnixTx+1T2020/course/): Consider auditing this Massive Open Online Course (MOOC) if you are not comfortable working in the terminal or working with shell scripting
  30 | 
  31 | ## Before You Start..
  32 | ### Step 1. Requirements to Access the da Server(s)
  33 | 
  34 | To register for the hackathon/tutorial, please
  35 | generate a ssh public key. See instructions below.
  36 | 
  37 | For macOS and Unix users, the instructions below would work. Still, for Windows users, the best option is to enable [Ubuntu Shell](https://winaero.com/blog/how-to-enable-ubuntu-bash-in-windows-10) or [install Linux on Windows with WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10) first, then follow instructions for Unix/macOS. 
  38 | Alternatively, you may use the [OpenSSH Module for PowerShell](https://www.techrepublic.com/blog/10-things/how-to-generate-ssh-keys-in-openssh-for-windows-10/) or [Git-Bash](https://docs.joyent.com/public-cloud/getting-started/ssh-keys/generating-an-ssh-key-manually/manually-generating-your-ssh-key-in-windows#Git-Bash) as an alternative option.
  39 | 
  40 | To generate a ssh key, open a terminal window and run the `ssh-keygen` command. Once completed, it produces the `id_rsa.pub` and `id_rsa` files inside your $HOME/.ssh/ folder. 
  41 | To view the newly generated *public key*, type:
  42 | 
  43 | ```
  44 | cat ~/.ssh/id_rsa.pub
  45 | ```
  46 | 
  47 | You will need to provide this ssh *public key* when you complete the **WoC Registration Form** (step 3), as the form will ask you for the contents of `id_rsa.pub` and your **GitHub** and **Bitbucket**  handles (step 2). You will receive a response to the email you provide in the form once your account is set up (more details below).
  48 | 
  49 | Set up your `~/.ssh/config` file so that you can log in to one of the da servers without having to fully specify the server name each time:
  50 | 
  51 | ```
  52 | Host *
  53 |   ForwardAgent yes
  54 | 
  55 | Host da0
  56 | 	Hostname da0.eecs.utk.edu
  57 | 	Port 443
  58 | 	User YourUsername
  59 | 	IdentityFile ~/.ssh/name_of_priv_key
  60 | ```
  61 | Please note that access to the remaining servers is similarly available. da2 and da3 have ssh port 22 (both are running the worldofcode.org web server on the https port 443)
  62 | 
  63 | *Your_UserName* is the login name you provided on the signup form. With the config setup, logging in becomes as simple as typing `ssh da0` in your terminal.
  64 | 
  65 | ### Step 2. GitHub and Bitbucket Accounts Setup
  66 | 
  67 | If you dont have these already, please set up an account on both
  68 | GitHub and Bitbucket (these will be needed to invite you to the
  69 | relevant repositories on GitHub & Bitbucket).
  70 | * [GitHub Sign-up](https://github.com/pricing)  
  71 | * [Bitbucket Sign-up](https://bitbucket.org/account/signup/)  
  72 | 
  73 | ### Step 3. Request for Access
  74 | 
  75 | Users may access our systems/servers by obtaining a WoC account. You may do so by registering for an account through the [WoC Registration Form](https://docs.google.com/forms/d/e/1FAIpQLSd4vA5Exr-pgySRHX_NWqLz9VTV2DB6XMlR-gue_CQm51qLOQ/viewform?vc=0&c=0&w=1&flr=0&usp=mail_form_link). We strive to provide access to new users the same day you fill out the form, but in the worst-case scenario, please allow up to 1 day for the account creation.
  76 | 
  77 | ## Tutorial Objectives
  78 | 
  79 | Prepare for the hackathon or perform research, make sure connections work, get familiar with the basic functionality and potential of WoC, and start thinking about how to investigate global relationships in open source.
  80 | 
  81 | ### WoC Objectives
  82 | 
  83 | Do the hard work to enable research on global properties of (Free, Libre, and Open Source Software) FLOSS: 
  84 | 
  85 | * Census of all FLOSS
  86 |    - What is out there, of what kind, how much
  87 |    - Ability to select projects/developers/APIs for natural experiments/other empirical studies 
  88 | * Provide FLOSS-wide relationships
  89 |    - Technical dependencies (to run applications)
  90 |    - Tool dependencies (to build/test applications)
  91 |    - Code copying
  92 |    - Knowledge (and people) migration
  93 |    - API use and spread over time
  94 | * Data Cleaned/Augmented/Contextualized
  95 |    - Correction: Authors/Forks/Outliers
  96 |    - Augmentation: Dependencies/Linking to other data sources
  97 |    - Context: project types/expertise
  98 | * Big Data Analytics: Map entities to all related entities efficiently
  99 | * Timely: Targeting < 1 Quarter old analyzable snapshot of the entire FLOSS
 100 | * Community run
 101 |    - Hackathons help determine the community needs
 102 |    - [Hackathon Schedule](https://github.com/woc-hack/schedule)
 103 | * How to participate?
 104 |    - [Hackathon Registration Form](http://bit.ly/WoCSignup)
 105 |    - If you can not attend the hackathon but just want to try out WoC, please fill the hackathon form but indicate in the topic section is that you do not plan to attend the hackathon.
 106 | 
 107 | ### What WoC Contains
 108 | 
 109 | ![Workflow](https://github.com/woc-hack/tutorial/blob/master/Assets/Database-workflow.png)
 110 | ![Content: Commits., trees, blobs, projects, authors](https://github.com/woc-hack/tutorial/blob/master/Assets/Database.png)
 111 | 
 112 | ### Related background reading
 113 | 
 114 | - [About WoC](https://bitbucket.org/swsc/overview/raw/master/pubs/WoC.pdf)
 115 | - [Overview of the Software Supply Chains](https://bitbucket.org/swsc/overview/src/master/README.md)
 116 | - [Details on WoC storage/APIs](https://bitbucket.org/swsc/lookup/src/master/README.md)
 117 | - [Fun Facts](https://bitbucket.org/swsc/overview/src/master/fun/README.md)
 118 | 
 119 | ## Activity 1: Access to da Server(s)
 120 | 
 121 | Log in: `ssh da0`.
 122 | 
 123 | Once you are in a da server, you will have an empty directory under `/home/username` where you can store your programs and files:  
 124 | ```
 125 | -bash-4.2$ pwd
 126 | /home/username
 127 | -bash-4.2$ 
 128 | ```
 129 | 
 130 | Set up your shell:
 131 | ```
 132 | -bash-4.2$ echo 'exec bash' >> .profile
 133 | -bash-4.2$ echo 'export PS1="\u@\h:\w>"' >> ~/.bashrc
 134 | -bash-4.2$ . .bashrc
 135 | [username@da0]~%
 136 | ```
 137 | 
 138 | You can also login to other da servers, but first need to set up an ssh key on these systems:
 139 | ```
 140 | [username@da0]~% ssh-keygen
 141 | Generating public/private rsa key pair.
 142 | Enter file in which to save the key (/home/niravajmeri/.ssh/id_rsa): 
 143 | Enter passphrase (empty for no passphrase): 
 144 | Enter same passphrase again: 
 145 | Your identification has been saved in /home/niravajmeri/.ssh/id_rsa.
 146 | Your public key has been saved in /home/niravajmeri/.ssh/id_rsa.pub.
 147 | The key fingerprint is:
 148 | SHA256:/UoJkpnx5mn8jx4BhcnQRUFfPq4qmC1MVLRJSjpYnpo niravajmeri@da0.eecs.utk.edu
 149 | The key's randomart image is:
 150 | +---[RSA 2048]----+
 151 | |    . o=o**.  .  |
 152 | |   + + o*+ . o   |
 153 | |  . = o.+   . o  |
 154 | |   o ..* o   . . |
 155 | |  E  .= S o   .  |
 156 | |      .= o + .   |
 157 | |     o += + o    |
 158 | |      =.oo =     |
 159 | |       . o*..    |
 160 | +----[SHA256]-----+
 161 | ```
 162 | Once the key is generated add it to your .ssh/auhorized_keys
 163 | ```
 164 | [username@da0]~% cat .ssh/id_rsa.pub >> .ssh/authorized_keys 
 165 | ```
 166 | 
 167 | Now you can login to da4:
 168 | ```
 169 | [username@da0]~% ssh da4
 170 | [username@da4]~% 
 171 | ```
 172 | 
 173 | ### Exercise 1
 174 | 
 175 | Log in to da0 and clone two repositories that contain APIs to access WoC data
 176 | ```
 177 | [username@da0]~% git clone https://bitbucket.org/swsc/lookup
 178 | [username@da0]~% git clone https://github.com/ssc-oscar/oscar.py
 179 | ```
 180 | 
 181 | Log in to da4 from da0:
 182 | ```
 183 | [username@da0]~% ssh da4
 184 | [username@da4]~% ls
 185 | ...
 186 | [username@da4]~% exit
 187 | [username@da0]~%
 188 | ```
 189 | 
 190 | ### Important Note
 191 | 
 192 | Make sure to access these directories and execute a `git pull` frequently to ensure you are working with latest updates.
 193 | 
 194 | ## Activity 2: Shell APIs - Basic Operations
 195 | 
 196 | Shell APIs might be useful to accsess content of the commit, trees, blobs, 
 197 | to calculate the diff produced by a commit, etc. 
 198 | 
 199 | For more examples [see full API](https://bitbucket.org/swsc/lookup/src/master/README.md). 
 200 | 
 201 | Lets look at the commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a:
 202 | 
 203 | ```
 204 | [username@da0]~% echo 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a | ~/lookup/showCnt commit 3
 205 | tree 464ac950171f673d1e45e2134ac9a52eca422132
 206 | parent dddff9a89ddd7098a1625cafd3c9d1aa87474cc7
 207 | author Warner Losh <imp@FreeBSD.org> 1092638038 +0000
 208 | committer Warner Losh <imp@FreeBSD.org> 1092638038 +0000
 209 | 
 210 | Don't need to declare cbb module.  don't know why I never saw
 211 | duplicate messages..
 212 | ```
 213 | 
 214 | This commit has a tree and a parent commit and is created by 'Warner Losh <imp@FreeBSD.org>'. 
 215 | (parameter 3 defines that raw output needs to be produced)
 216 | 
 217 | Lets inspect the tree (the root folder of the project) for the first and last file:
 218 | ```
 219 | [username@da0]~% echo 464ac950171f673d1e45e2134ac9a52eca422132 | ~/lookup/showCnt tree | awk 'NR==1; END{print}'
 220 | 100644;a8fe822f075fa3d159a203adfa40c3f59d6dd999;COPYRIGHT
 221 | 040000;6618176f9f37fa3e62f2efd953c07096f8ecf6db;usr.sbin
 222 | ```
 223 | 
 224 | We may also want inspect the first element in the tree (blob representing file COPYRIGHT). We limit the output to the first two lines only:
 225 | ```
 226 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/showCnt blob | head -n 2
 227 | # $FreeBSD$
 228 | #  @(#)COPYRIGHT  8.2 (Berkeley) 3/21/94
 229 | ```
 230 | 
 231 | ### Important Note
 232 | 
 233 | When wanting to get the content of many objects (or look up
 234 | values for many keys), please use a single function invocation and
 235 | provide multiple keys/sha1s as standard input since each call to showCnt
 236 | and getValues may involve ssh to another da server (where the data
 237 | resides).  
 238 | To separate content of separate blobs, you can ask showCnt to put
 239 | output on a single line, for example:
 240 | ```
 241 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/showCnt blob 1
 242 | ```
 243 | This command will produce a single line output, starting from sha1:
 244 | ```
 245 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;IyAkRnJlZUJTRCQKIwlAKCMpQ09Q....
 246 | ```
 247 | The content of the blob is base64 encoded (use python's base64.b64decode).
 248 | 
 249 | ### Exercise 2
 250 | 
 251 | Determine the author of the parent commit for commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a
 252 | 
 253 | Hint 1: parent commit is listed in the content of commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a above
 254 | ```
 255 | [username@da0]~% echo dddff9a89ddd7098a1625cafd3c9d1aa87474cc7 | ~/lookup/showCnt commit
 256 | ```
 257 | 
 258 | ### Summary for Activity 2
 259 | 
 260 | Synopsis:
 261 | ```
 262 | ~/lookup/showCnt commit|tree|blob
 263 | ```
 264 | reads from the standard input sha1 of the corresponding objects and 
 265 | prints the content of these objects. 
 266 | 
 267 | ## Activity 3: Investigate the Maps
 268 | 
 269 | We see the content of the copyright file above. Such files are often copied verbatim. Lets determine the first author who has created it (irrespective of a repository).
 270 | WoC has created this relationship and stored in b2fa (Blob to First Author) map:
 271 | 
 272 | ```
 273 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 |  ~/lookup/getValues b2fa
 274 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;1072910122;Warner Losh <imp@FreeBSD.org>;121f970412fec7f9af0352a9b4ce8dca43bdb59e
 275 | ```
 276 | 
 277 | It turns out that it was created by commit 121f970412fec7f9af0352a9b4ce8dca43bdb59e done by what appears to be the same author on unix second 1072910122.
 278 | 
 279 | What is b2fa? The letters signify what keys (b - Blob) and values
 280 | (fa - first author) mean. As in natural sentence some decontextualization is needed in rare cases as this because f generally stands for file. Literally, that would mean b2fa is blob to file and author. As the number of objects and maps will multiply, single letters will not do and full word parsing will be used. 
 281 | 
 282 | **Primary Objects:**
 283 | 
 284 | * a = Author (A - aliased author)
 285 | * b = Blob  (b2c map will become obsolete as of version U as one can get more info from b2tac)
 286 | * c = Commit, cc - child commit and pc - parent commit
 287 | * f = File (occasionally its an adjective modifying the following object as in fa or First Author)
 288 | * p = Project (P - deforked project)
 289 | * t = Time (unix unsigned long in UTC)
 290 | * g = gender
 291 | 
 292 | **Capital Version** - simply means that the data has been corrected:   
 293 | * A = aliased version (see https://arxiv.org/abs/2003.08349); any organizational and group IDs, bot IDs, as well as author IDs, that do not
 294 | contain sufficient info to alias are excluded.
 295 | * P = deforked project (via Leuwen community
 296 | detection on commit / repo bi-graph: https://arxiv.org/abs/2002.02707)
 297 | 
 298 | We can inspect the relationships between a and A and also between p
 299 | and P:
 300 | ```
 301 | [username@da0]~% echo 'Warner Losh <imp@FreeBSD.org>' | ~/lookup/getValues a2A
 302 | Warner Losh <imp@FreeBSD.org>;imp <imp@bsdimp.com>
 303 | [username@da0]~% 
 304 | [username@da0]~% echo 'imp <imp@bsdimp.com>' | ~/lookup/getValues A2a | tr ";" "\n" | head -n 3
 305 | imp <imp@bsdimp.com>
 306 | M. Warner Losh <imp@bsdimp.com>
 307 | M. Warner Losh <imp@freebsd.org>
 308 | ```
 309 | 
 310 | Going back to blob we may ask if this blob has been widely copied as would be expected for copyright files. We can use b2tac to obtain a sha1's blob to time, author, and commit. The following example pipes the output to only see the first entry:
 311 | 
 312 | ```
 313 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 |  ~/lookup/getValues b2tac | cut -d ";" -f1-4          
 314 | a8fe822f075fa3d159a203adfa40c3f59d6dd999;1072910122;Warner Losh <imp@FreeBSD.org>;121f970412fec7f9af0352a9b4ce8dca43bdb59e
 315 | ```
 316 | 
 317 | b2tac (blob to time, author, commit) shows the numerous commits that introduced that blob in all repositories. We can further use commit to project map (c2p)
 318 | to identify all associated projects:
 319 | 
 320 | ```
 321 | [username@da0]~% echo a8fe822f075fa3d159a203adfa40c3f59d6dd999 | ~/lookup/getValues b2tac | awk -F \; '{for(i=4;i<NF;i+=3){print $i}}' | ~/lookup/getValues -f c2p | cut -d ";" -f2 | sort -u |  head -3
 322 | 0cjs_unix-history-repo
 323 | 0mp_freebsd
 324 | 0xbda2d2f8_freebsd
 325 | ```
 326 | 
 327 | In fact, there are 1719 distinct repositories where this blob appears. If you would like to see them all, remove the last `head` portion in the previous command.
 328 | 
 329 | Finally, we have the author 'Warner Losh <imp@FreeBSD.org>' for the commit we have investigated. 
 330 | Can we find what other commits Warner has made? (The following output is limited to three commits only):
 331 | ```
 332 | [username@da0]~% echo 'Warner Losh <imp@FreeBSD.org>' | ~/lookup/getValues a2c | tr ";" "\n" |  head -n 4
 333 | Warner Losh <imp@FreeBSD.org>
 334 | 0000ce4417bd8d9a2d66a7a61393558d503f2805
 335 | 000109ae96e7132d90440c8fa12cb7df95a806c6
 336 | 0001246ed9e02765dfc9044a1804c3c614d25dde
 337 | ```
 338 | 
 339 | In addition to variable-length records (key;val1,;val2;...;valn), 
 340 | the output can be produced as a flat table (key;val1\nkey;val2\n...\nkey;valn)
 341 | using -f option:
 342 | ```
 343 | [username@da0]~% echo 'Warner Losh <imp@FreeBSD.org>' | ~/lookup/getValues -f a2c | head -n 5
 344 | Warner Losh <imp@FreeBSD.org>;0000ce4417bd8d9a2d66a7a61393558d503f2805
 345 | Warner Losh <imp@FreeBSD.org>;000109ae96e7132d90440c8fa12cb7df95a806c6
 346 | Warner Losh <imp@FreeBSD.org>;0001246ed9e02765dfc9044a1804c3c614d25dde
 347 | Warner Losh <imp@FreeBSD.org>;00014b72bf10ad43ca437daf388d33c4fea73df9
 348 | Warner Losh <imp@FreeBSD.org>;000153916157b29a14b65fa3efeff4e3788e1b0e
 349 | ```
 350 | 
 351 | In addition to the random lookup, the maps are also stored in flat
 352 | sorted files and this format is preffered (faster) when
 353 | investigating over two hundred thousand items or an entire WoC. 
 354 | For example, find commits by any author named Warner (a similar
 355 | task would be to find all blobs or commits involving a c-language
 356 | file ".c" or a README file "README"): 
 357 | ```
 358 | [username@da0]~% zcat /da7_data/basemaps/gz/a2cFull.V3.0.s | grep 'Warner Losh'
 359 | ```
 360 | As described below, the maps are split into 32 (or 128) parts to enable parallel search.
 361 | Full.V3.0 means that we are looking at a complete extract at version V3.0. 
 362 | 
 363 | As versions keep being updated, and data no longer fits on a single server, 
 364 | a more flexible way to run the same command would be
 365 | ```
 366 | [username@da0]~% zcat /da?_data/basemaps/gz/a2cFull.V3.?.s | grep 'Warner Losh'
 367 | ```
 368 | In other words, we look for the file on any of the servers and selecting an arbitrary 
 369 | version of the datbase.
 370 | 
 371 | ### Exercise 3
 372 | 
 373 | a) Find all files modified by author id 'Warner Losh <imp@FreeBSD.org>'
 374 | 
 375 | Hint 1: What is the map name?
 376 | 
 377 | Author ID to File or a2f
 378 | ```
 379 | [username@da0]~% echo 'Warner Losh <imp@FreeBSD.org>' | ~/lookup/getValues a2f 
 380 | ```
 381 | 
 382 | Find all commits  developers who have your last and your first name:
 383 | 
 384 | Hint 1: use wc (word count), e.g. (example takes a long time to compute), 
 385 | 
 386 | ```
 387 | [username@da0]~% zcat /da0_data/basemaps/gz/a2cFull*.s | grep -i 'audris' | grep -i 'mockus' | wc -l
 388 | ```
 389 | 
 390 | b) Find all files modified by all author IDs used by a developer 'Warner Losh <imp@FreeBSD.org>'
 391 | 
 392 | Hint 1: What is the map name?
 393 | A represents all author IDs so we first get the group name:
 394 | ```
 395 | echo 'Warner Losh <imp@FreeBSD.org>' | ~/lookup/getValues a2A
 396 | Warner Losh <imp@FreeBSD.org>;imp <imp@bsdimp.com>
 397 | ```
 398 | 
 399 | and then use it to get all files via A2f
 400 | ```
 401 | [username@da0]~% echo 'imp <imp@bsdimp.com>' | ~/lookup/getValues A2f 
 402 | ```
 403 | 
 404 | ### Summary for Activity 3
 405 | 
 406 | For any key provided on standard input, a list of values is provided
 407 | ```
 408 | ~/lookup/getValues [-f] a2c|c2dat|b2ta|b2fa|c2b|b2f|c2f|p2c|c2p|c2P|P2c
 409 | ```
 410 | option -f replaces one output line per input line into the number of lines corresponding to the number of values. 
 411 | (or single-value maps such as c2dat, b2fa) -f makes no sense as it prints distinct fields on separate lines)
 412 | 
 413 | Also, only the first column of the input is considered as the key, other fields are passed through, e.g., 
 414 | ```
 415 | [username@da0]~% echo 'Warner Losh <imp@FreeBSD.org>;zz' | ~/lookup/getValues -f a2c| head -n 3
 416 | Warner Losh <imp@FreeBSD.org>;zz;0000ce4417bd8d9a2d66a7a61393558d503f2805
 417 | Warner Losh <imp@FreeBSD.org>;zz;000109ae96e7132d90440c8fa12cb7df95a806c6
 418 | Warner Losh <imp@FreeBSD.org>;zz;0001246ed9e02765dfc9044a1804c3c614d25dde
 419 | ```
 420 | 
 421 | ## Actvity 4: Exploring the State of a Repo at the Last Commit
 422 | 
 423 | Lets suppose we only care for the last version of the files in a project, e.g last version of readme. 
 424 | lb2f (last blob to file) provides this relationship
 425 | 
 426 | ```
 427 | [username@da0]~% zcat /da?_data/basemaps/gz/lb2fFullV0.s | grep -i readme | head -n 5
 428 | 00000057bfb6f79bdfd129f113533f9ada77cbba;/README.md
 429 | 000000ad43fb50661d0f8ba20035f8f8a62b28b1;/README.md
 430 | 000000c4ca807de513cd601810522141ed8347bf;/Day-92 Collection/readMe
 431 | 000001222e62cd97679e0ed087c74037bab8f848;/README.md
 432 | 0000013e32eb5f7497750cf652cfd540f23abb3e;/README.md
 433 | ```
 434 | 
 435 | To get projects we just need to join it with b2P
 436 | 
 437 | ```
 438 | [username@da0]~% zcat /da?_data/basemaps/gz/lb2fFullV0.s | grep -i readme | join -t\; -1 1 -2 1 - <( zcat /da?_data/basemaps/gz/b2PFullV0.s) | head -n 5
 439 | 00000057bfb6f79bdfd129f113533f9ada77cbba;/README.md;yoooov_certinel
 440 | 000000ad43fb50661d0f8ba20035f8f8a62b28b1;/README.md;LeeYoonSam_SampleNodeEjsBoard
 441 | 000000c4ca807de513cd601810522141ed8347bf;/Day-92 Collection/readMe;22MCA10027_-100daysofcodeChallenge
 442 | 000001222e62cd97679e0ed087c74037bab8f848;/README.md;magnusrygh_cluster
 443 | 0000013e32eb5f7497750cf652cfd540f23abb3e;/README.md;sneakyGrit_hello-world
 444 | ```
 445 | 
 446 | Each project also has the last commit in lc2Pdat
 447 | 
 448 | ```
 449 | [username@da0]~% zcat /da?_data/basemaps/gz/lc2PdatFullV1.s | head -n 1 | tr ";" "\n"
 450 | 01000009d4c8d8f088e30519131e4e60cf61e969
 451 | Dushyant099_Tetris
 452 | 1500262480
 453 | -0400
 454 | Dushyant Patel <dushyant@Dushyants-MacBook-Air.local>
 455 | 214c30ce8162a624f1f2442ff7bed46d0fb7b4b1
 456 | 9e46a5cd45ce0adf3afe24ce616f5be0315c72b2
 457 | ```
 458 | 
 459 | In fact, lb2f is computed from lc2dat by taking the tree (column 6 of  lc2Pdat) and obtaining all blobds in that tree in, recusively, subtrees.
 460 | 
 461 | ## Activity 5: Using Python APIs from oscar.py
 462 | 
 463 | **oscar.py Tutorial:** oscar.py has their own tutorial for hackathon purposes. We suggest that you go [here](https://github.com/ssc-oscar/oscar.py/blob/master/docs/tutorial.md) and read through it. The tutorial contains information about the current available functions, how to implement applications (simple and complex), and useful imports for applications.
 464 | 
 465 | **Important Note:** If you experience any difficulties in retrieving data from oscar.py's function calls (i.e., you receive an empty tuple on function return), please run `git pull` in your cloned repo to stay up-to-date with the latest version of oscar.py.   
 466 | 
 467 | These are corresponding functions in oscar.py that open the .tch files listed below for a given entity. "/<function_name>" after a function name denotes the version of that function that returns a Generator object.
 468 | 
 469 | 1. `Author('...')`  - initialized with a combination of name and email
 470 | 	* `.blobs`
 471 | 	* `.commit_shas/commits`
 472 | 	* `.project_names`
 473 | 	* `.files`
 474 | 	* `.torvald` - returns the torvald path of an Author, i.e, who did this Author work
 475 | 				 with that also worked with Linus Torvald
 476 | 2. `Blob('...')` -  initialized with SHA of blob
 477 | 	* `.author` - returns timestamp, author name, and binary SHA of commit
 478 | 	* `.commit_shas/commits` - commits removing this blob are not included
 479 | 	* `.data` - content of the blob
 480 | 	* `.file_sha(filename)` - compute blob sha from a file content
 481 | 	* `.position` - get offset and length of blob data in storage
 482 | 	* `.parent`
 483 | 	* `.string_sha(string)`
 484 | 	* `.tkns` - result of ctags run on this blob, if there were any
 485 | 3. `Commit('...')` - initialized with SHA of commit
 486 | 	* `.blob_shas/blobs`
 487 | 	* `.child_shas/children`
 488 | 	* `.changed_file_names/files_changed`
 489 | 	* `.parent_shas/parents`
 490 | 	* `.project_names/projects`
 491 | 	* `.attributes` - time, tz, author, tree, parent(s)
 492 | 	* `.tdiff`
 493 | 5. Deprecated, see [#50](https://github.com/ssc-oscar/oscar.py/issues/50): `File('...')` - initialized with a path, starting from a commit root tree
 494 | 	* `.authors`
 495 | 	* `.blobs`
 496 | 	* `.commit_shas/commits`
 497 | 6. `Project('...')` - initialized with project name/URI
 498 | 	* `.author_names`
 499 | 	* `.commit_shas/commits`	
 500 | 7. `Tdiff('...')` - initialized with SHA, result of diff run on 2 blobs (if there was a diff)
 501 | 	* `.commit`
 502 | 	* `.file`
 503 | 8. `Tree('...')` - representation of a git tree object (dir), initialized with SHA of tree
 504 | 	* `.files`
 505 | 	* `.blob_shas/blobs`
 506 | 	* `.commit_shas/commits`
 507 | 	* `.traverse`
 508 | 
 509 | The non-Generator version of these functions will return a tuple of items which can then be iterated:
 510 | ```
 511 | for commit in Author(author_name).commit_shas:
 512 | 	print(Commit(commit))
 513 | ```
 514 | 
 515 | ### Exercise 5a: Get a list of commits made by a specific author
 516 | 
 517 | Install the latest oscar.py
 518 | 
 519 | ```
 520 | [username@da0]~% cd ~/oscar.py
 521 | ```
 522 | 
 523 | If "import oscar" fails
 524 | 
 525 | ```
 526 | [username@da0]~% easy_install --user clickhouse-driver
 527 | ```
 528 | 
 529 | As we learned before, we can do that in shell
 530 | 
 531 | ```
 532 | [username@da0]~% zcat /da?_data/basemaps/gz/a2cFull.V3.0.s | grep '"Albert Krawczyk" <pro-logic@optusnet.com.au>' | head -n 3
 533 | "Albert Krawczyk" <pro-logic@optusnet.com.au>;17abdbdc90195016442a6a8dd8e38dea825292ae
 534 | "Albert Krawczyk" <pro-logic@optusnet.com.au>;2a98c68d153f1fd78cc356727263a2046abf887d
 535 | "Albert Krawczyk" <pro-logic@optusnet.com.au>;3cdd0e1cefbec43a9c3d3138dd6734191529763a
 536 | ```
 537 | 
 538 | Now the same thing can be done using oscar.py:  
 539 | 
 540 | ```
 541 | [username@da0]~% cd oscar.py
 542 | [username@da0:oscar.py]~% python3
 543 | >>> from oscar import Author, Commit
 544 | >>> for i, commit in enumerate(Author('"Albert Krawczyk" <pro-logic@optusnet.com.au>').commit_shas):
 545 | ...     if i >= 3:
 546 | ...         break
 547 | ...     print(Commit(commit))
 548 | ...
 549 | 17abdbdc90195016442a6a8dd8e38dea825292ae
 550 | 2a98c68d153f1fd78cc356727263a2046abf887d
 551 | 3cdd0e1cefbec43a9c3d3138dd6734191529763a
 552 | >>>
 553 | ```
 554 | 
 555 | ### Exercise 5b: Get the URL of a projects repository using the oscar.py `Project(...).url` attribute:  
 556 | ```
 557 | [username@da0:oscar.py]~%  python3
 558 | >>> from oscar import Project
 559 | >>> Project('notcake_gcad').url
 560 | 'https://github.com/notcake/gcad'
 561 | ```
 562 | 
 563 | ### Exercise 5c
 564 | 
 565 | Get list of files modified by commit 17abdbdc90195016442a6a8dd8e38dea825292ae
 566 | 
 567 | Hint 1: What class to use?
 568 | Commit
 569 | ```
 570 | [username@da0:oscar.py]~%  python3
 571 | >>> from oscar import Commit
 572 | >>> Commit('17abdbdc90195016442a6a8dd8e38dea825292ae').changed_file_names
 573 | ```
 574 | 
 575 | ## Activity 6: Understanding Servers and folders
 576 | 
 577 | All home folders are on da2, so it is preferred not to do very large
 578 | file operations to/from these folders  when running tasks on servers
 579 | other than da2,
 580 | since these operations will load NFS and may slow access to home
 581 | folders of other users.
 582 | 
 583 | Each server has /data/play folder where you can create your
 584 | subfolders to store/process large files.
 585 | 
 586 | ### List of relevant directories
 587 | 
 588 | Not all files are stored on all servers due to limited disk sizes
 589 | and different speed of disks (fast refers to SSDs).
 590 | The location of the file can be identified via a pathname as described below. 
 591 | 
 592 | ### da0/../da5 Servers
 593 | #### <relationship>.{0-31}.tch files can be found in `/da[0-5]_fast/` or `/da[0-5]_data/basemaps`  
 594 | (.s) signifies that there are either .s or .gz versions of these files in /da[0-5]_data/basemaps/gz/ folder, which can be opened with Python gzip module or Unix zcat.  
 595 | all five da[0-5] server may have these .s/.gz files  
 596 | Keys for identifying letters:   
 597 | 
 598 | * a = Author
 599 | * b = Blob
 600 | * c = Commit
 601 | * cc = Child Commit
 602 | * f = File
 603 | * h = Head Commit
 604 | * ob = Parent Blob
 605 | * p = Project
 606 | * pc = Parent Commit
 607 | * P = Forked/Root Project (see Note below)
 608 | * ta = Time;Author
 609 | * fa = First;Author;commit
 610 | * r = root commit obtained by traversing commit history
 611 | * h = head commit obtained by traversing commit history
 612 | * td = Tdiff
 613 | * tk = Tokens (ctags)
 614 | * trp = Torvalds Path
 615 | 
 616 | Version T keys for identifying letters:
 617 | * L = LICENSE* files
 618 | * Lb - blobs that are shared among fewer than 100 Projects
 619 | * fb = firstblob
 620 | * tac = time, author, commit
 621 | * t = root tree
 622 | 
 623 | Recall that the captal version of author A means aliased version (see
 624 | https://arxiv.org/abs/2003.08349) and it also means that
 625 | organizational and group IDs, bot IDs as well as author IDs that do not
 626 | contain sufficient info to alias are excluded.
 627 | Similarly, the capital version of project P represents a deforked project (via Leuwen community
 628 | detection on commit / repo bi-graph: https://arxiv.org/abs/2002.02707)
 629 | 
 630 | List of relationships can be obtained via
 631 | ```
 632 | echo $(ls /da?_data/basemaps/gz/*FullV0.s| sed 's|.*/||;s|FullV0.s||')
 633 | A2P A2c A2mnc P2A P2a P2c P2core P2g P2mnc P2tac a2P a2c a2p c2P c2acp c2cc c2dat c2p c2pc p2a
 634 | p2c A2b A2f A2fb A2tPc A2tPlPkg A2tspan P2b P2binf P2f P2fb P2nfb P2tAlPkg P2tspan Pkg2tPA Pt2Ptb
 635 | Ptb2Pt a2f a2fb b2P b2def b2fA b2f b2fa b2ob b2ptf b2tA b2tP b2ta b2tk b bb2cf c2PtAbflDef
 636 | c2PtAbflPk g c2PtabflDef c2PtabflPkg c2b c2f c2fbb lb2f lc2Pdat ob2b obb2cf t2all t2ptf tk2b
 637 | ```
 638 | 
 639 | 
 640 | ```
 641 | * a2b 		* a2c (.s)	* a2f		* a2ft		
 642 | * a2p (.s)	* a2trp0 (.s)
 643 | * b2a		* b2tac (.s) 	* b2f (.s)	* b2ob		* ob2b
 644 | * b2tk
 645 | * c2b (.s)	* c2cc		* c2f (.s)	* c2h		* c2pc
 646 | * c2p (.s)	* c2P		* c2ta (.s)	* c2td
 647 | * p2a (.s)	* p2c (.s)	* P2c
 648 | * td2c		* td2f
 649 | ```
 650 | 
 651 | Special relationships (names do not correspond to keys):
 652 | ```
 653 | Versions T or U:
 654 | b2f[aA]  - blob to time, author, commit for the first commit creating that blob
 655 | b2tac  - blob to time, author, commit for all commits creating that blob
 656 | bb2cf  - result of diff on a commit: blob old blob, commit, file
 657 | obb2cf - see bb2cf but blobs reversed
 658 | c2fbb  - result of diff on a commit: commit file, blob, old blob
 659 | 
 660 | P2core - Project to devs who make 80+% of the commits
 661 | 
 662 | b2fLICENSE - grep for LICESNSE in b2f
 663 | bL2P - license blob to project
 664 | 
 665 | c2dat - full commit data in semicolon-separated fields
 666 | 
 667 | dl2Pf - API defined; language; project; file
 668 | ====
 669 | ```
 670 | 
 671 | Note: c2P returns the most central repository for this commit, and does not include repos that forked off of this commit. 
 672 |       P2c returns ALL commits associated with this repo, including commits made to forks of this particular repo. 
 673 | 	  The list of relationships is not exhaustive and more information can be found at https://github.com/woc-hack/tutorial/issues/17#issuecomment-850823408
 674 | 
 675 | ### Exercise 6
 676 | 
 677 | Find all blobs associated with Julia language files (extension .jl) 
 678 | 
 679 | Hint 1: What is the name of the map?
 680 | 
 681 | ```
 682 | [username@da0] zcat /da?_data/basemaps/gz/b2fFullU*.s | grep '\.jl;'
 683 | ```
 684 | 
 685 | ## Activity 7: Investigating Technical dependencies
 686 | 
 687 | The technical dependencies have been extracted by parsing the content of all blobs related to 
 688 | several different languages: and, for version V, are located in
 689 | `/da7_data/basemaps/gz/c2PtAbflPkgFullVX.s` with X ranging from 0
 690 | to 127 based on the 7 bits in the first byte of the commit sha1. 
 691 | 
 692 | 
 693 | The format of each file is encoded in its name: 
 694 | ```
 695 | commit;deforked repo;timestamp;Aliased author;blob;filename;language (as used in WoC);module1;module2;...`  
 696 | 
 697 | ```
 698 | for example
 699 | ```
 700 | 000000000fcd56c8536abd09cac5f2a54ba600c2;not-an-aardvark_lucky-commit;1510643197;Teddy Katz <teddykatz@fb.com>;d9730ab3fca05f4d61e7225de5731868cfb99fb6;lucky-commit.c;C;errno.h;string.h;math.h;zlib.h;stdio.h;sha.h;stdbool.h;stdlib.h;stat.h
 701 | ```
 702 | 
 703 | Unlike in version R where each language had a separate thruMaps
 704 | directory, info on all languages is kept in a single place. 
 705 | 
 706 | To identify the implementation of various packages one can use
 707 | `/da?_data/basemaps/gz/c2PtAbflDefFullUX.s` with X ranging from 0
 708 | to 127 based on the 7 bits in the first byte of the commit sha1. 
 709 | for example
 710 | ```
 711 | zcat /da?_data/basemaps/gz/c2PtAbflDefFullU0.s|head
 712 | 0000000000abc668c5388237320e97d0dadae7b1;not-an-aardvark_lucky-commit;1613716402;Teddy Katz <teddykatz@fb.com>;050e87971a0a069043821c8d5f0c55d1f4761edc;Cargo.toml;Rust;lucky_commit
 713 | 0000000000abc668c5388237320e97d0dadae7b1;not-an-aardvark_lucky-commit;1613716402;Teddy Katz <teddykatz@fb.com>;61aebdecc47b2b7521a353b1cc180b2af1080977;Cargo.lock;Rust;addr2line
 714 | ```
 715 | Instead of the list of dependencies the last field identifies the
 716 | package implemented within the blob, specifically, lucky_commit and
 717 | addr2line in the above two blobs.
 718 | 
 719 | The Def relationship in WoC tracks blobs that define a package based 
 720 | on the content of the source code. There is no guarantee that only one project will have it due to copying and other reasons. Identifying the which repository is the true upstream one may not be that difficult, however. 
 721 | 
 722 | Def relationship points only to blobs that define the package (e.g., blobs for file setup.py in Python, packages.json in JavaScript, etc.). This can be used to identify 
 723 | repositories (or parts of the repositories) where these package metafiles reside.   	
 724 | 	
 725 | *TODO*:put it into clickhouse to speed up access. 
 726 | 
 727 | Lets get a list of commits and repositories that imported Tensorflow for .py files:  
 728 | ```
 729 | [username@da0]~%zcat c2PtAbflPkgFullU76.s |grep tensorflow|head -2
 730 | 000005efe300482514d70d44c5fa922b34ff79a5;Rayhane-mamah_Tacotron-2;1557284915;qq443452099 <47710489+qq443452099@users.noreply.github.com>;05604b3f0632e98cc0eee3afef589dc5031f3a43;tacotron/synthesizer.py;PY;tacotron.utils.text.text_to_sequence;tacotron.utils.plot;tacotron.models.create_model;wave;datasets.audio;os;librosa.effects;tensorflow;infolog.log;datetime.datetime;io;numpy
 731 | 000005efe300482514d70d44c5fa922b34ff79a5;Rayhane-mamah_Tacotron-2;1557284915;qq443452099 <47710489+qq443452099@users.noreply.github.com>;49bc3b8b6533b93941223ccbeb401e47e5a573d7;hparams.py;PY;tensorflow;numpy
 732 | ```
 733 | 
 734 | ### Exercise 7
 735 | 
 736 | Find all repositories using Julia language that import package 'StaticArrays'
 737 | 
 738 | 
 739 | Hint 1: What file to look for?
 740 | ```
 741 | [username@da0]~% zcat /da?_data/basemaps/gz/c2PtabllfPkgFullS*.s | grep ';jl;' | grep StaticArrays
 742 | ```
 743 | 
 744 | Hint 2: What field contains the repository name?
 745 | ```
 746 | [username@da0]~% zcat /da?_data/basemaps/gz/c2PtabllfPkgFullS*.s | grep ';jl;'| grep StaticArrays | cut -d\; -f2 | sort -u
 747 | ```
 748 | 
 749 | ## Activity 8: Investigating Copy-Based Reuse
 750 | 
 751 | WoC's operationalization of copy-based supply chains is based on mapping blobs 
 752 | (versions of the source code) to all commits and projects where they have been created. 
 753 | For each blob, all commits are sorted based on their timestamp and the project in which
 754 | the first commit exists is identified as the originator and all other projects as the reuser 
 755 | of that blob. These files are located in `/da?_data/basemaps/gz/Ptb2PtFullVX.s` with X ranging from 0
 756 | to 127 based on the 7 bits in the first byte of the blob sha1. 
 757 | 
 758 | <div align="center">
 759 |     <img src="/Assets/reuse_DFD.png" alt="reuse_DFD">
 760 |     <p>Reuse Identification Data Flow Diagram</p>
 761 | </div>
 762 | 
 763 | The format of each file is encoded in its name: 
 764 | ```
 765 | originating repo;timestamp;blob;destination repo;timestamp  
 766 | 
 767 | ```
 768 | for example
 769 | ```
 770 | zhunengfei_ExtJS6.2-samples;1466402956;00000056a59bde3926f65c334caef688ccad0a08;bitbucket.org_mastercad_sencha_demo;1551632725
 771 | ```
 772 | This means that blob 00000056a59bde3926f65c334caef688ccad0a08 was first seen in zhunengfei_ExtJS6.2-samples at 1466402956
 773 | and was reused by bitbucket.org_mastercad_sencha_demo at 1551632725.
 774 | 
 775 | ## Activity 9: OSS License Identification
 776 | 
 777 | The proliferation of OSS has created a complex landscape of licensing practices, making accurate license identification essential for legal and compliance purposes. 
 778 | WoC uses a comprehensive approach, scanning all blobs with "license" in their filepath and applying the winnowing algorithm for reliable text matching against known licenses.
 779 | 
 780 | This method successfully identifies and matches over 5.5 million unique license blobs across projects, generating a detailed project to license map.
 781 | 
 782 | This map is stored at `/da?_data/basemaps/gz/P2LtFullV.s`.
 783 | 
 784 | The file format is encoded as follows:
 785 | ```
 786 | deforkedProject;License;time
 787 | ```
 788 | The "time" field is in the "YYYY-MM" format and represents the commit timestamp when the license blob was committed to the project. This field may also have an "invalid" value, indicating that the commit timestamp was not valid (e.g., a future time due to discrepancies in the user's system time).
 789 | 
 790 | Additionally, since these timestamps only represent when the license was committed to the project and do not indicate whether the license is still present, the latest commit tree (before the WoC version V data collection date, 2023-05) of each project was examined. If the license blob was found in the latest commit, a record was added with the time set as "latest."
 791 | 
 792 | When interpreting the data, it's important to note that the scope of license detection does not include code files or references to licenses within project documentation.
 793 | 
 794 | ## Activity 10: Suggested by the audience
 795 | 
 796 | Find all projects that have commits mentioning "sql injection"
 797 | 
 798 | List of commits is on /da4_data/All.blobs/
 799 | Lets login to da4, create a data folder to store temporary data on the same server
 800 | "/data/play/username", and uce pcommit to project map to get the list of projects.
 801 | 
 802 | ```
 803 | [username@da0]~% ssh da4
 804 | [username@da4]~% mkdir /data/play/audris
 805 | [username@da4]~% cd /data/play/audris
 806 | [username@da4:/data/play/audris]~% cut -d\; -f4 commit_*.idx | ~/lookup/showCmt.perl 2 | grep -i 'sql injection' > 
 807 | [username@da4:/data/play/audris]~% cut -d\; -f1 sql_inject | ~/lookup/getValues.perl /da0_data/basemaps/c2pFullP > sql_inject.c2p
 808 | ```
 809 | 
 810 | ## Activity 11: Summary of the activities undertaken
 811 | 
 812 | * Shell API (faster) and Python API (also Perl API not illustrated) for random access
 813 | 
 814 | * Sorted compressed tables for sweeps (grep)
 815 | 
 816 | * Key-Value maps to link authors, commits, files, projects, and blobs
 817 | 
 818 | * Overview of naming conventions server/data/databases
 819 | 
 820 | * Mongodb tables with the summary information about authors and projects to enable selection of subsets for later analysis: (e.g, I want authors with at least 100 commits who worked no less than three years and participated in at least five java projects.)
 821 | 
 822 | ### Summary Activity 11
 823 | 
 824 | * What type of usability improvements are needed?
 825 | 
 826 | * What types of tasks would you like to work during the hackathon?
 827 | 
 828 | * What would make you a long-time user of WoC?
 829 | 
 830 | 
 831 | ## Self paced part of the tutorial
 832 | 
 833 | The remainig activities are provided to illustrate various realistic 
 834 | tasks. 
 835 | 
 836 | ## Activity S0: Finding 1st-time imports for AI modules (Simple)
 837 | 
 838 | Given the data available, this is a fairly simple task. Making an application to detect the first time that a repo adopted an AI module would give you a better idea as to when it was first used, and also when it started to gain popularity.  
 839 | 
 840 | A good example of this is in
 841 | [popmods.py](https://github.com/ssc-oscar/aiframeworks/blob/master/popmods.py). In
 842 | this application, we can read all 128 c2PtabllfPkgFullS*.s
 843 | files and look for a given module with the earliest import times. The program then creates a <module_name>.first file, with each line formatted as `repo_name;UNIX_timestamp`.  
 844 | 
 845 | TODO: update popmods.py to work with c2PtabllfPkgFullS*.s
 846 | Usage: `[username@da0]~%  python popmods.py language_file_extension module_name`  
 847 | 
 848 | Before anything else (and this can be applied to many other
 849 | programs), you want to know what your input looks like ahead of time
 850 | and know how you are going to parse it.
 851 | Since each line of the file has this format:  
 852 | ```
 853 | commit;deforked repo;timestamp;author;blob;language (as used in WoC);language (as determined by ctags);filename;module1;module2;...
 854 | ```
 855 | 
 856 | We can use the `string.split()` method to turn this string into a list of words, split by a semicolon (;).  
 857 | By turning this line into a list, and giving it a variable name, `entry = ['commit', 'repo_name', 'timestamp', ...]`, we can then grab the pieces of information we need with `repo, time = entry[1], entry[2]`. 
 858 | 
 859 | An important idea to keep in mind is that we only want to count unique timestamps once. This is because we want to account for repositories that forked off of another repository with the exact timestamp of imports. An easy way to do this would be to keep a running list of the times we have come across, and if we have already seen that timestamp before, we will simply skip that line in the file:  
 860 | ```
 861 | ...
 862 | if time in times:
 863 | 	continue
 864 | else:
 865 | 	times.append(time)
 866 | ...
 867 | ```
 868 | We also want to find the earliest timestamp for a repository importing a given module. Again, this is fairly simple:  
 869 | ```
 870 | ...
 871 | if repo not in dict.keys() or time < dict[repo]:
 872 | 	for word in entry[5:]:
 873 | 		if module in word:
 874 | 			dict[repo] = time
 875 | 			break
 876 | ...
 877 | ```
 878 | #### Implementing the application
 879 | 
 880 | Now that we have the .first files put together, we can take this one step further and graph a modules first-time usage over time on a line graph, or even compare multiple modules to see how they stack up against each other. [modtrends.py](https://github.com/ssc-oscar/aiframeworks/blob/master/modtrends.py) accomplishes this by:  
 881 | 
 882 | * reading 1 or more .first files 
 883 | * converting each timestamp for each repository into a datetime date
 884 | * "rounding" those dates by year and month
 885 | * putting those dates in a dictionary with `dict["year-month"] += 1`
 886 | * graphing the dates and frequencies using matplotlib.
 887 | 
 888 | If you want to compare first-time usage over time for Tensorflow and
 889 | Keras for the .ipynb language .first files you created, run:
 890 | ```
 891 | UNIX> python3.6 modtrends.py tensorflow.first keras.first
 892 | ```
 893 | The final graph looks something like this:  
 894 | [![Tensorflow vs Keras](../ipynb_first/Tensorflow-vs-Keras.png "Tensorflow vs Keras")](https://github.com/ssc-oscar/aiframeworks/blob/master/charts/ipynb_charts/Tensorflow-vs-Keras.png)
 895 | 
 896 | 
 897 | 
 898 | ## Activity S1: Detecting percentage language use and changes over time (Complex) 
 899 | 
 900 | An application to calculate this would be useful for seeing how different authors changed languages over a range of years, based on the commits they have made to different files.  
 901 | In order to accomplish this task, we will modify an existing program from the swsc/lookup repo ([a2fBinSorted.perl](https://bitbucket.org/swsc/lookup/src/master/a2fBinSorted.perl)) and create a new program ([a2L.py](https://bitbucket.org/swsc/lookup/src/master/a2L.py)) that will get language counts per year per author.  
 902 | 
 903 | #### Part 1 -- Modifying a2fBinSorted.perl
 904 | For the first part, we look at what a2fBinSorted.perl currently does: it takes one of the 32 a2cFullP{0-31}.s files thru STDIN, opens the 32 c2fFullO.{0-31}.tch files for reading, and writes a corresponding a2fFullP.{0-31}.tch file based on the a2c file number. The lines of the file being `author_id;file1;file2;file3...`  
 905 | 
 906 | Example usage: `UNIX> zcat /da0_data/basemaps/gz/a2cFullP0.s | ./a2fBinSorted.perl 0`
 907 | 
 908 | We can modify this program so that it will write the earliest commit dates made by that author for those files, which will become useful for a2L.py later on. To accomplish this, we will have the program additionally read from the c2taFullP.{0-31}.tch files so we can get the time of each commit made by a given author:  
 909 | ```
 910 | my %c2ta;
 911 | for my $s (0..($sections-1)){
 912 | 	tie %{$c2ta{$s}}, "TokyoCabinet::HDB", "/fast/c2taFullP.$s.tch", TokyoCabinet::HDB::OREADER |
 913 | 	TokyoCabinet::HDB::ONOLCK,
 914 | 	   16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
 915 | 	or die "cant open fast/c2taFullP.$s.tch\n";
 916 | }
 917 | ```
 918 | We will also ensure the files to be written will have the relationship a2ft as oppposed to a2f:  
 919 | ```
 920 | my %a2ft;
 921 | tie %a2ft, "TokyoCabinet::HDB", "/data/play/dkennard/a2ftFullP.$part.tch", TokyoCabinet::HDB::OWRITER | 
 922 |      TokyoCabinet::HDB::OCREAT,
 923 | 	16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
 924 | 	or die "cant open /data/play/dkennard/a2ftFullP.$part.tch\n";
 925 | ```
 926 | Another important part of the file we want to change is inside the `output` function:  
 927 | ```
 928 | sub output {
 929 | 	my $a = $_[0];
 930 | 	my %fs;
 931 | 	for my $c (@cs){
 932 | 		my $sec =  segB ($c, $sections);
 933 | 		if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){
 934 | 			my @fs = split(/\;/, safeDecomp ($c2f{$sec}{$c}, $a), -1);
 935 | 			my ($time, $au) = split(/\;/, $c2ta{$sec}{$c}, -1);  #add this for grabbing the time
 936 | 			for my $f (@fs){
 937 | 				if (defined $time and (!defined $fs{$f} or $time < $fs{$f})){ #modify condition to grab earliest time
 938 | 					$fs{$f} = $time;
 939 | 				}
 940 | 			}
 941 | 		}
 942 | 	}
 943 | 	$a2ft{$a} = safeComp (join ';', %fs); #changed
 944 | }
 945 | ```
 946 | Now when we run the new program, it should write individual a2ftFullP.{0-31}.tch files with the format:  
 947 | `author_id;file1;file1_timestamp;file2;file2_timestamp;...`  
 948 | 
 949 | We can then create a new PATHS dictionary entry in oscar.py, as well as making another function under the Author class to read our newly-created .tch files:  
 950 | ```
 951 | In PATHS dictionary:
 952 | ...
 953 | 'author_file_times': ('/data/play/dkennard/a2ftFullP.{key}.tch', 5)
 954 | ...
 955 | 
 956 | In class Author(_Base):
 957 | ...
 958 | @cached_property
 959 | def file_times(self):
 960 | 	data = decomp(self.read_tch('author_file_times'))
 961 | 	return tuple(file for file in (data and data.split(";")))
 962 | ...
 963 | ```
 964 | 
 965 | #### Part 2 -- Creating a2L.py
 966 | Our next task involves creating a2LFullP{0-31}.s files utilizing the new .tch files we have just created. We want these files to have each line filled with the author name, each year, and the language counts for each year. A possible format could look something like this:  
 967 | `"tim.bentley@gmail.com" <>;year2015;2;py;31;js;30;year2016;1;py;29;year2017;8;c;2;doc;1;py;386;html;6;sh;1;js;3;other;3;build;1`  
 968 | where the number after each year represents the number of languages used for that year, followed by pairs of languages and the number of files written in that language for that year. As an example, in the year 2015, Tim Bentley made initial commits to files in 2 languages, 31 of which were in Python, and 30 of which were in JavaScript.  
 969 | 
 970 | There is a number of things that have to happen to get to this point, so lets break it down:  
 971 | 
 972 | * Iterating Author().file_times and grouping timestamps into year 
 973 | 
 974 | We will start by reading in a a2cFullP{0-31}.s file to get a list of authors, which we then hold as a tuple in memory and start building our dictionary:  
 975 | ```
 976 | a2L[author] = {}
 977 | file_times = Author(author).file_times
 978 | for j in range(0,len(file_times),2):
 979 | 	try:
 980 | 		year = str(datetime.fromtimestamp(float(file_times[j+1]))).split(" ")[0].split("-")[0]
 981 | 	#have to skip years either in the 20th century or somewhere far in the future
 982 | 	except ValueError:
 983 | 		continue
 984 | 	#in case the last file listed doesnt have a time
 985 | 	except IndexError:
 986 | 		break
 987 | 	year = specifier + year #specifier is the string 'year'
 988 | 	if year not in a2L[author]:
 989 | 		a2L[author][year] = []
 990 | 	a2L[author][year].append(file_times[j])
 991 | ```
 992 | The datetime.fromtimestamp() function will turn this into a datetime format: `year-month-day hour-min-sec` which we split by a space to get the first half `year-month-day` of the string, and then split again to get `year`.  
 993 | 
 994 | * Detecting the language of a file based on file extension
 995 | ```
 996 | for year, files in a2L[author].items():
 997 | 	build_list = []
 998 | 	for file in files:
 999 | 		la = "other"
1000 | 		if re.search("\.(js|iced|liticed|iced.md|coffee|litcoffee|coffee.md|ts|cs|ls|es6|es|jsx|sjs|co|eg|json|json.ls|json5)$",file):
1001 | 			la = "js"
1002 | 		elif re.search("\.(py|py3|pyx|pyo|pyw|pyc|whl|ipynb)$",file):
1003 | 			la = "py"
1004 | 		elif re.search("(\.[Cch]|\.cpp|\.hh|\.cc|\.hpp|\.cxx)$",file):
1005 | 			la = "c"
1006 | 	.......
1007 | ```
1008 | The simplest way to check for a language based on a file extension is to use the re module for regular expressions. If a given file matches a certain expression, like `.py`, then that file was written in Python. `la = other` if no matches were found in any of those searches. We then keep track of these languages and put each language in a list `build_list.append(la)`, and count how many of those languages occurred when we looped through the files `build_list.count(lang)`. The final format for an author in the a2L dictionary will be `a2L[author][year][lang] = lang_count`.  
1009 | 
1010 | * Writing each authors information into the file
1011 | 
1012 | See [a2L.py](https://bitbucket.org/swsc/lookup/src/master/a2L.py) for how information is written into each file.  
1013 | 
1014 | Usage: `UNIX> python a2L.py 2` for writing `a2LFullP2.s`  
1015 | 
1016 | #### Implementing the application
1017 | Now that we have our a2L files, we can run some interesting statistics as to how significant language usage changes over time for different authors. The program [langtrend.py](https://bitbucket.org/swsc/lookup/src/master/langtrend.py) runs the chi-squared contingency test (via the stats.chi2_contingency() function from scipy module) for authors from an a2LFullP{0-31}.s file and calculates a p-value for each pair of years for each language for each author.   
1018 | This p-value means the percentage chance that you would find another person (say out of 1000 people) that has this same extreme of change in language use, whether that be an increase or a decrease. For example, if a given author editied 300 different Python files in 2006, but then editied 500 different Java files in 2007, the percentage chance that you would see this extreme of a change in another author is very low. In fact, if this p-value is less than 0.001, then the change in language use between a pair of years is considered "significant".  
1019 | 
1020 | In order for this p-value to be a more accurate approximation, we need a larger sample size of language counts. When reading the a2LFullP{0-31}.s files, you may want to rule out people who dont meet certain criteria:  
1021 | 
1022 | * the author has at least 5 consecutive years of commits for files
1023 | * the author has edited at least 100 different files for all of their years of commits
1024 | 
1025 | If an author does not meet this criteria, we would not want to consider them for the chi-squared test simply because their results would be "uninteresting" and not worth investigating any further.  
1026 | 
1027 | Here is one of the authors from the programs output:  
1028 | ```
1029 | ----------------------------------
1030 | Ben Niemann <pink@odahoda.de>
1031 | { '2015': {'doc': 3, 'markup': 2, 'obj': 1, 'other': 67, 'py': 127, 'sh': 1},
1032 | 	'2016': {'doc': 1, 'other': 23, 'py': 163},
1033 | 	'2017': {'build': 36, 'c': 116, 'lsp': 1, 'other': 81, 'py': 160},
1034 | 	'2018': { 'build': 12,
1035 | 		'c': 134,
1036 | 		'lsp': 2,
1037 | 		'markup': 2,
1038 | 		'other': 133,
1039 | 		'py': 182},
1040 | 	'2019': { 'build': 13,
1041 | 		'c': 30,
1042 | 		'doc': 8,
1043 | 		'html': 10,
1044 | 		'js': 1,
1045 | 		'lsp': 2,
1046 | 		'markup': 16,
1047 | 		'other': 67,
1048 | 		'py': 134}}
1049 | 	pfactors for obj language
1050 | 		2015--2016 pfactor == 0.9711606775110577  no change
1051 | 	pfactors for doc language
1052 | 		2015--2016 pfactor == 0.6669499228133753  no change
1053 | 		2016--2017 pfactor == 0.7027338745275937  no change
1054 | 		2018--2019 pfactor == 0.0009971248193242038  rise/drop
1055 | 	pfactors for markup language
1056 | 		2015--2016 pfactor == 0.5104066960256399  no change
1057 | 		2017--2018 pfactor == 0.5532258789014389  no change
1058 | 		2018--2019 pfactor == 1.756929555308731e-05  rise/drop
1059 | 	pfactors for py language
1060 | 		2015--2016 pfactor == 1.0629725495084215e-07  rise/drop
1061 | 		2016--2017 pfactor == 1.2847558344252341e-25  rise/drop
1062 | 		2017--2018 pfactor == 0.7125543569718793  no change
1063 | 		2018--2019 pfactor == 0.026914075872778477  no change
1064 | 	pfactors for sh language
1065 | 		2015--2016 pfactor == 0.9711606775110577  no change
1066 | 	pfactors for other language
1067 | 		2015--2016 pfactor == 1.7143130378377696e-06  rise/drop
1068 | 		2016--2017 pfactor == 0.020874234589765908  no change
1069 | 		2017--2018 pfactor == 0.008365948846657284  no change
1070 | 		2018--2019 pfactor == 0.1813919210757513  no change
1071 | 	pfactors for c language
1072 | 		2016--2017 pfactor == 2.770649054044977e-16  rise/drop
1073 | 		2017--2018 pfactor == 0.9002187643203734  no change
1074 | 		2018--2019 pfactor == 1.1559110387953382e-08  rise/drop
1075 | 	pfactors for lsp language
1076 | 		2016--2017 pfactor == 0.7027338745275937  no change
1077 | 		2017--2018 pfactor == 0.8855759560371912  no change
1078 | 		2018--2019 pfactor == 0.9944669523033288  no change
1079 | 	pfactors for build language
1080 | 		2016--2017 pfactor == 4.431916568235125e-05  rise/drop
1081 | 		2017--2018 pfactor == 5.8273175348446296e-05  rise/drop
1082 | 		2018--2019 pfactor == 0.1955154860787908  no change
1083 | 	pfactors for html language
1084 | 		2018--2019 pfactor == 0.0001652525618661536  rise/drop
1085 | 	pfactors for js language
1086 | 		2018--2019 pfactor == 0.7989681687355706  no change
1087 | ----------------------------------
1088 | ```  
1089 | 
1090 | Although it is currently not implemented, one could take this one step further and visually represent an authors language changes on a graph, which would be simpler to interpret as opposed to viewing a long list of p values such as the one shown above.  
1091 | 
1092 | ## Activity S2: Useful Python imports for applications
1093 | ### subprocess
1094 | Simlar to the C version, system(), this module allows you to run UNIX processes, and also allows you to gather any input, output, or error from those processes, all from within a Python script. This module becomes especially useful when you are looking for specific lines out of a .s/.gz file, as opposed to reading the entire file which takes more time.  
1095 | A good example usage for subprocess is when we read the c2bPtaPkgO$LANG.{0-31}.gz files for first-time AI module imports in popmods.py. Rather than reading one of these files in its entirety, we look for lines of the file that have a specific module we are looking for.  
1096 | ```
1097 | for i in range(32):
1098 | 	print("Reading gz file number " + str(i))
1099 | 	command = "zcat /data/play/" + dir_lang + "thruMaps/c2bPtaPkgO" + dir_lang + "." + str(i) + ".gz"
1100 | 	p1 = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
1101 | 	p2 = subprocess.Popen("egrep -w " + module, shell=True, stdin=p1.stdout, stdout=subprocess.PIPE) 
1102 | 	output = p2.communicate()[0]
1103 | ```
1104 | We can then iterate the lines of this output accordingly, and gather the pieces of information we need:  
1105 | ```
1106 | for entry in str(output).rstrip('\n').split("\n"):
1107 | 	entry = str(entry).split(";")
1108 | 	repo, time = entry[1], entry[2]
1109 | ```
1110 | 
1111 | Additional documentation on subprocess can be found [here](https://docs.python.org/2/library/subprocess.html).  
1112 | 
1113 | ----------
1114 | ### re
1115 | The re (Regular Expression) module is another useful import for pattern-matching in strings.  
1116 | Addtional re documentation can be found [here](https://docs.python.org/2/library/re.html).  
1117 | 
1118 | ----------
1119 | ### matplotlib
1120 | Useful graphing module for creating visual representations.  
1121 | Extensive documentation can be found [here](https://matplotlib.org/).  
1122 | 
1123 | ## Activity S3: a comparison of oscar.py vs. Perl scripts
1124 | 
1125 | When it comes to creating new relationship files (.tch/.s files), using Perl over Python for large data-reading is more time-saving overall. This situation occurred in the complex application we covered where we modified an existing Perl file to get the initial commit times of each file for each author, rather than using Python to accomplish this task.  
1126 | Before making this decision, one of our team members decided to run a test between 2 programs, [a2ft.py](https://bitbucket.org/swsc/lookup/src/master/a2ft.py) and [a2ft.perl](https://bitbucket.org/swsc/lookup/src/master/a2ft.perl). These programs were run at the same time for a period of 10 minutes. Both programs had the same task of retrieving the earliest commit times for each file under each author from a2cFullP{0-31}.s files. The Python version calls the `Commit_info().time_author` and `Commit().changed_file_names` functions from oscar.py. The Perl version ties each of the 32 c2fFullO.{0-31}.tch (Commit().changed_file_names) and c2taFullP.{0-31}.tch (Commit_info().time_author) files into 2 different Perl hashes (Python dictionary equivalent), %c2f and %c2ta. The speed difference between Perl and Python was quite surprising:  
1127 | 
1128 | ```
1129 | [username@da3]/data/play/dkennard% ll a2ftFullP0TEST1.s
1130 | -rw-rw-r--. 1 dkennard dkennard 980606 Jul 22 11:56 a2ftFullP0TEST1.s
1131 | [username@da3]/data/play/dkennard% ll a2ftFullPTEST2.0.tch
1132 | -rw-r--r--. 1 dkennard dkennard 663563424 Jul 22 11:56 a2ftFullPTEST2.0.tch
1133 | ```  
1134 | 
1135 | Within this 10 minute period, the Python version only wrote 980,606 bytes of data into the TEST1 file shown above, whereas the Perl version wrote 663,563,424 bytes into the TEST2 file.  
1136 | The main reason oscar.py is slower, in theory, is because oscar.py has more private function calls that it has to perform in order to calculate the key (0-31) and locate where the requested information is stored. Upon further inspection of the [oscar.py](https://github.com/ssc-oscar/oscar.py/blob/master/oscar.py) functions that are called, we can see that there are between 6-7 function calls for each lookup. All of these function calls cause function overhead and thus increase the amount of time to retrieve data for multiple entities.  
1137 | In the Perl version of a2ft, the program simply calls `segB()`, which calculates the key of where the information is stored. The function takes a string and the number 32 as arguments (ex. segB(commit_sha, 32)):   
1138 | 
1139 | ```
1140 | sub segB {
1141 | 	my ($s, $n) = @_;
1142 | 	return (unpack "C", substr ($s, 0, 1))%$n;
1143 | }
1144 | ```  
1145 | 
1146 | Because the %c2f and %c2ta Perl hashes are tied to their respective .tch files, we can then check if a specific commit in a specific number section is defined:  
1147 | 
1148 | ```
1149 | for my $c (@cs){	#where cs is a list of commits for an author and c is one of those commits
1150 | 	my $sec =  segB ($c, $sections);
1151 | 	if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){
1152 | 		...
1153 | 	}
1154 | 	...
1155 | }
1156 | ```
1157 | 
1158 | This is not to say that oscar.py is inefficient and should not be utilized, but it is not the optimal solution for creating new .tch or .s relationship files. oscar.py solely provides a Python interface for gathering requested data out of the respective .tch files and not for mass-reading all 32 files. It also provides simple function calls that were mentioned earlier in the tutorial for retrieving bits of information at a time in a more convenient way.
1159 | 
1160 | 
1161 | ## Activity S4: Plumbing of WoC
1162 | 
1163 | We can obtain a diff for any commit. It requires comparing trees of it and its parent:
1164 | 
1165 | Lets find the diff for 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a: 
1166 | ```
1167 | [username@da0]~% echo 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a | ~/lookup/cmputeDiff 
1168 | 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a;/sys/dev/pccbb/pccbb_isa.c;9d5818e25865797b96e4783b00b45f800423e527;594dc8cb2ce725658377bf09aa0f127183b89f77
1169 | 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a;/sys/dev/pccbb/pccbb_pci.c;b3c1363c90de7823ec87004fe084f41d0f224c9b;4155935a98ba3b5d3786fa1b6d3d5aa52c6de90a
1170 | ```
1171 | We can see it modifying two files.
1172 | 
1173 | ### Exercise
1174 | 
1175 | Calculate the change made to /sys/dev/pccbb/pccbb_isa.c by commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a.
1176 | 
1177 | Hint1: Get the old and new blob for  /sys/dev/pccbb/pccbb_isa.c
1178 | 
1179 | Hint2: Use shell redirect output '>' to save the content of each blob
1180 | 
1181 | Hint3: Use unix diff to calculate the difference
1182 | ```
1183 | [username@da0]~% diff old new
1184 | ```
1185 | 
1186 | ## Iterating over a dataset
1187 | 
1188 | Sometimes, iterating over the entire dataset using the already created basemaps is the only way to retrieve the desired information. The basemappings from one datatype to another are key-value pairings of data. As such, the retrieval of the entire dataset can usually be done in one pass over one of the already created basemaps.
1189 | 
1190 | For example, if the goal was to determine information pertaining to each author in WoC, simply iterating over one of the many basemaps from author to some other dataset e.g. (a2b, a2c, etc) will serve. Since these datasets are a key-value mapping from author to another dataset, this guarantees that each of the keys will be one of the unique authors in WoC. From there, the desired information about that specific author can be determined. 
1191 | 
1192 | Below is a Perl script template that allows for retrieval of all the authors from a2c. 
1193 | 
1194 | -----------------------
1195 | ```perl
1196 | #!/usr/bin/perl -I /home/audris/lookup -I /home/audris/lib64/perl5 -I /home/audris/lib/x86_64-linux-gnu/perl -I /usr/local/lib64/perl5 -I /usr/local/share/perl5
1197 | use strict;
1198 | use warnings;
1199 | use Error qw(:try);
1200 | 
1201 | use TokyoCabinet;
1202 | 
1203 | my $split = 32;
1204 | my %a2cF = (); 
1205 | for my $sec (0..($split-1)){   
1206 |   my $fname = "/da1_data/basemaps/a2cFullS.$sec.tch";   
1207 |   tie %{$a2cF{$sec}}, "TokyoCabinet::HDB", "$fname", TokyoCabinet::HDB::OREADER | TokyoCabinet::HDB::ONOLCK,  16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000 
1208 |   or die "cant open $fname\n"; 
1209 | }
1210 | while (my ($i, $author) = each %a2cF) {
1211 |   my $v1 = join ";", sort keys %{$author};
1212 |   my @apl = split(';', $v1);
1213 |   for my $a (@apl) {
1214 |     print "$a\n";
1215 |   }
1216 | }
1217 | 
1218 | ```
1219 | ---------------
1220 | This script simply prints each WoC authors name. This helps illustrate how to go about retrieving the keys in a key-value basemap using Perl, but lacks any practical use on it's own.
1221 | 
1222 | Notice in this script the $split is defined to be 32 and the for loop iterates from 0 to 31. The reason for this is because of how the data is stored in the basemaps. Each basemap from one data type to another is split into 32 roughly equal parts based on their hashes. As such, in order to iterate over the entire data set, it is neccesary to look at each of these files separately.
1223 | 
1224 | From there, Perl allows for direct tying to each of these files in the format of a hash. Because the basemappings are saved using TokyoCabinet, it requires them to be opened using TokyoCabinet to retrieve the data.
1225 | 
1226 | Once the hash is tied to the mapping, iterating over the hash can be done, and retrieval of the information simply becomes accessing the elements.
1227 | 
1228 | ## Mongo Database
1229 | 
1230 | On the da1 server, there is a MongoDB server holding some relevant
1231 | data. This data includes some information that was used for data
1232 | analysis in the past. Mongo provides an excellent place to store
1233 | relatively small data without requiring relational information. 
1234 | 
1235 | Two collections the WoC database cand be helpful for sampling
1236 | projects and authors A_metadata.V and P.metadata.V where V
1237 | represents the version (e.g., T) , A stands for aliased author id
1238 | and P for deforked repository name. 
1239 | 
1240 | ### MongoDB Access
1241 | 
1242 | When on the da1 server, you can gain access to the MongoDB server simply by running the command 'mongo', or, when on any other da server, you can gain access by running 'mongo --host "da1.eecs.utk.edu"'.
1243 | 
1244 | Once on the server, you can see all the available databases using the "show dbs" command. However, the database that pertains primarily to the WoC is the WoC database. 
1245 | 
1246 | Most databases are used for teaching and other tasks, spo please use
1247 | WoC database using the 'use "database name"' command, E.G. (use WoC), and, after switching, you can view the available collections in the database by using the 'show collections' command. 
1248 | 	
1249 | Currently, there is an author metadata collection (A_metadata.U)
1250 | that contains basic stats: the total number of projects, 
1251 | 	the total number of blobs created by them (before
1252 | anyone else), the total number
1253 | of commits made, the total number of files they have modified, the
1254 | distribution of language files modified by that author, 
1255 | and the first and last time the author committed in Unix Timestamp based on the
1256 | data contained on version S of WoC. Author names have ben aliased
1257 | and the number of aliases and the list are also included in the record.
1258 | Furthermore, up to 100 most commonly used API (packages) in author modified files are 
1259 | also included. 
1260 | 	
1261 | Alongside this, there is a similar collection for projects on WoC
1262 | (P_metadata.U) that contains the total number of authors on the
1263 | project, the total number of commits, the total number of files, the
1264 | distribution of languages used, the first and last time there was a
1265 | commit to the project in Unix Timestamp based on the version U of
1266 | WoC. Since the project is deforked, the community size (the number
1267 | of other projects that share commits with the deforeked project) is
1268 | also provided. WoC relation P2p can be used to list other projects
1269 | linked to it.  We also provide additional info based on linking to
1270 | attributes that exist only in GitHub. That data is not as recent,
1271 | however and more work is needed to make it complete. These
1272 | attributes include the number of stars GitHub has given that project
1273 | (if any), if the project is a GitHub fork, and where it was forked
1274 | from (if anywhere).
1275 | 
1276 | Finally the collection of APIs or packages contains summary of the first and last time the package was used in a modified file as well as the number of commits, authors and repositories associated with the use of that package.
1277 | 	
1278 | To see data in one of the collections, you can run the 'db."collection name".findOne()' command. This will show the first element in the collection and should help clarify what is in the collection.
1279 | 
1280 | 
1281 | When the above findOne() command is run on the A_metadata.U
1282 | collection, the output is as follows: (in this case we look only for items with more than 200 commits:
1283 | 
1284 | -----------
1285 | ```
1286 | mongosh --host=da1
1287 | mongosh>use WoC;
1288 | WoC> db.A_metadata.U.findOne({NumCommits:{$gt:200}})
1289 | { 
1290 |   _id: ObjectId("62229b132bc6e5f0dbd0307f"),
1291 |   FileInfo: {
1292 |     Ruby: 2,
1293 |     TypeScript: 1,
1294 |     Python: 2,
1295 |     Rust: 92,
1296 |     PHP: 6895,
1297 |     other: 2406,
1298 |     Sql: 1,
1299 |     JavaScript: 1384,
1300 |     'C/C++': 6,
1301 |     Java: 1
1302 |   },
1303 |   NumActiveMon: 19,
1304 |   EarliestCommitDate: 1512136069,
1305 |   ApiInfo: {
1306 |     'Rust:the': 1,
1307 |     'PY:sys': 1,
1308 |     'PY:datetime': 1,
1309 |     'Rust:on': 1,
1310 |     'PY:os': 1
1311 |   },
1312 |   LatestCommitDate: 1550574037,
1313 |   MonNprj: {
1314 |     '2019-02': 1,
1315 |     '2017-11': 5,
1316 |     '2018-04': 2,
1317 |     '2018-12': 2,
1318 |     '2018-03': 2,
1319 |     '2019-05': 1,
1320 |     '2019-04': 1,
1321 |     '2019-03': 1,
1322 |     '2018-05': 2,
1323 |     '2018-02': 3,
1324 |     '2018-08': 1,
1325 |     '2018-01': 1,
1326 |     '2019-06': 1,
1327 |     '2017-12': 4,
1328 |     '2018-10': 1,
1329 |     '2018-06': 1,
1330 |     '2018-11': 1,
1331 |     '2018-07': 1,
1332 |     '2019-01': 4
1333 |   },
1334 |   NumOriginatingBlobs: 2187,
1335 |   AuthorID: 'Jennifer Calipel <jennifer.calipel@gmail.com>',
1336 |   MonNcmt: {
1337 |     '2019-02': 9,
1338 |     '2017-11': 21,
1339 |     '2018-04': 13,
1340 |     '2018-12': 2,
1341 |     '2018-03': 29,
1342 |     '2019-05': 1,
1343 |     '2019-04': 2,
1344 |     '2019-03': 5,
1345 |     '2018-05': 9,
1346 |     '2018-02': 32,
1347 |     '2018-08': 42,
1348 |     '2018-01': 6,
1349 |     '2019-06': 2,
1350 |     '2017-12': 23,
1351 |     '2018-10': 1,
1352 |     '2018-06': 5,
1353 |     '2018-11': 1,
1354 |     '2018-07': 12,
1355 |     '2019-01': 17
1356 |   },
1357 |   NumCommits: 232,
1358 |   NumProjects: 18,
1359 |   NumFiles: 10790
1360 | }
1361 | ```
1362 | Similarly for projects:
1363 | ```
1364 |  WoC> db.P_metadata.U.findOne({NumCommits:{$gt:200}})
1365 | { 
1366 |   _id: ObjectId("62228cb7e65a0aefc2ca086f"),
1367 |   FileInfo: { other: 442, JavaScript: 17 },
1368 |   NumAuthors: 11,
1369 |   MonNauth: {
1370 |     '2020-04': 9,
1371 |     '2021-07': 1,
1372 |     '2020-11': 1,
1373 |     '2020-07': 1,
1374 |     '2021-05': 1,
1375 |     '2021-08': 1,
1376 |     '2020-08': 1,
1377 |     '2020-03': 5,
1378 |     '2020-06': 5,
1379 |     '2020-12': 1,
1380 |     '2020-05': 5
1381 |   },
1382 |   EarliestCommitDate: 1584055325,
1383 |   NumStars: 17,
1384 |   NumBlobs: 709,
1385 |   LatestCommitDate: 1628739252,
1386 |   ProjectID: 'foss-responders_fossresponders.com',
1387 |   MonNcmt: {
1388 |     '2020-04': 60,
1389 |     '2021-07': 2,
1390 |     '2020-11': 1,
1391 |     '2020-07': 1,
1392 |     '2021-05': 2,
1393 |     '2021-08': 2,
1394 |     '2020-08': 4,
1395 |     '2020-03': 48,
1396 |     '2020-06': 16,
1397 |     '2020-12': 3,
1398 |     '2020-05': 91
1399 |   },
1400 |   NumCore: 3,
1401 |   NumCommits: 230,
1402 |   CommunitySize: 1,
1403 |   NumFiles: 459,
1404 |   Core: {
1405 |     'SaptakS <saptak013@gmail.com>': '23',
1406 |     'Awele <awele.osuka@gmail.com>': '11',
1407 |     'Richard Littauer <richard@maintainer.io>': '157'
1408 |   },
1409 |   NumForks: 11
1410 | }
1411 | ```
1412 | And for APIs
1413 | ```
1414 | mongosh> WoC> db.API_metadata.U.findOne({$and: [ { NumCommits:{$gt:200} }, { NumProjects: {$gt:200} }, {NumAuthors:{$gt:200}} ]})
1415 | {
1416 |   _id: ObjectId("62257192758fdfbec79e9125"),
1417 |   NumAuthors: 456,
1418 |   Lang: 'C',
1419 |   NumProjects: 236,
1420 |   NumCommits: 4366,
1421 |   API: 'C:BasicUsageEnvironment.hh'
1422 | }
1423 | ```
1424 | ---------------
1425 | 
1426 | This metadata can then be parsed for the desired information.
1427 | 
1428 | Python, like most other programming languages, has an interface with
1429 | Mongo that makes for data storage/retrieval much simpler. When
1430 | retrieving or inputting large amounts of data onto the servers, it
1431 | is almost always faster and easier to do so through one of the
1432 | interfaces provided. 
1433 | 
1434 | 
1435 | ### PyMongo
1436 | 
1437 | PyMongo is an import for Python that simplifies access to the database and elements inside of it. When accessing the server you must first provide which Mongo Client you wish to connect to. For our server, the host will be "mongodb://da1.eecs.utk.edu/". 
1438 | This will allow access to the data already saved and will allow for creation of new data if desired. 
1439 | 
1440 | From there, accessing databases inside of the client becomes as simple as treating the desired database as an element inside the client. The same is true for accessing collections inside of a database. 
1441 | The below code illustrates this process.
1442 | 
1443 | --------
1444 | ```python3
1445 | import pymongo
1446 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/")
1447 | 
1448 | db = client["WoC"]                                                    
1449 | coll = db["A_metadata.U"]
1450 | ```
1451 | -------
1452 | 
1453 | #### Data Retrieval using PyMongo
1454 | 
1455 | When attempting to retrieve data, iterating over the entire collection for specific info is often neccesary. This is done most often through a mongo specific data structure called cursors. However, cursors have a limited life span. After roughly 10 minutes of continuous connection to the server, the cursor is forcibly disconnected. This is to limit the possible number of idle cursors connected to the server at any time. 
1456 | 
1457 | Taking this into consideration, if the process may take longer than that, it is neccesary to define the cursor as undying. If this is neccesary, manual disconnection of the cursor after it's served it's purpose is required as well.
1458 | The below code illustrates creation and iteration over the collection with a cursor.
1459 | 
1460 | --------
1461 | ```python
1462 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/")
1463 | 
1464 | db = client["WoC"]                                                    
1465 | coll = db["A_metadata.U"]
1466 | 
1467 | dataset = col.find({}, cursor_no_timeout=True)
1468 | for data in dataset:
1469 |    ...
1470 | 
1471 | dataset.close()
1472 | ```
1473 | -------
1474 | 
1475 | Once data retrieval has begun, accessing the specific information desired is simple. 
1476 | For example, provided above is the information saved in one element
1477 | of auth_metadata. If access to the AuthorID of each cursor is
1478 | desired, the "AuthorID" can be treated as the key in a key
1479 | value-mapping. However, it is often neccesary to consider how the
1480 | data is stored. 
1481 | 
1482 | Most often, when storing data in Mongo, it will be stored in Mongo
1483 | specific format called BSON. BSON objects are saved in
1484 | unicode. Working with unicode can be an issue if printing needs to
1485 | be done. As such, decoding from unicode must to be done. Below
1486 | illustrates a small program that prints each AuthorID from the
1487 | auth_metadata collection. 
1488 | 
1489 | ----------
1490 | ```python
1491 | import pymongo
1492 | import bson
1493 | 
1494 | client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/")
1495 | db = client ['WoC']
1496 | coll = db['A_metadata.U']
1497 | 
1498 | dataset = coll.find({}, no_cursor_timeout=True)
1499 | for data in dataset:
1500 |     a = data["AuthorID"].encode('utf-8').strip()
1501 |     print(a)
1502 | 
1503 | dataset.close()
1504 | 
1505 | ```
1506 | ----------
1507 | 
1508 | When retrieving data, it is often neccesary to narrow the
1509 | results. This is possible directly through Mongo when querying for
1510 | information. For instance, if all the data is not needed in the
1511 | auth_metadata, just the NumCommits and the AuthorID, the query can
1512 | be restricted adding parameters to the find call. An example query
1513 | is provided below. 
1514 | 
1515 | ----------
1516 | ```python
1517 | dataset = coll.find({}, {"AuthorID": 1, "NumCommits": 1, "_id": 0}, no_cursor_timeout=True)
1518 | 
1519 | for data in dataset:
1520 |     print(data)
1521 |     
1522 | dataset.close()
1523 | ```
1524 | ---------
1525 | 
1526 | This specific call allows for direct printing of the data, however, as noted above, the names are saved in BSON and as such will be printed in unicode. The first 10 results are shown below.
1527 | 
1528 | -------------
1529 | ```
1530 | {u'NumCommits': 1, u'AuthorID': u'  <mvivekananda@virtusa.com>'}
1531 | {u'NumCommits': 0, u'AuthorID': u' <1151643598@163.com>'}
1532 | ...
1533 | ```
1534 | --------------
1535 | 
1536 | Sometimes, restricting the data even further is neccesary. Notice above that many of the users have 0 commits. Exclusion of these entries may be desired. The below example illustrates a way to restrict the results to only users with greater than 0 commits.
1537 | 
1538 | ----------
1539 | ```python
1540 | dataset = coll.find({"NumCommits : { "$gt" : 0 } }, 
1541 | 				     {"AuthorID": 1, "NumCommits": 1, "_id": 0}, 
1542 | 				     no_cursor_timeout=True)
1543 | 
1544 | for data in dataset:
1545 |     print(data)
1546 | ```
1547 | ---------
1548 | 
1549 | ## Accesing by time slices
1550 | 
1551 | To access collection indexed by time we use clickhouse databse:
1552 | https://clickhouse.yandex/docs/en/getting_started/tutorial/
1553 | 
1554 | 
1555 | It has interfaces to various languages but the key is super fast indexing and the ability to distribute data over a cluster of da servers.
1556 | 
1557 | Since only commits have time associated with them, we start from storing all commits in the database. We store only a subset of commits on each server, first we create a table local to each server (commits) and a table that represents all servers (commits_a):
1558 | ```
1559 | for h in da0 da1 da2 da3 da4; 
1560 | do echo "CREATE TABLE api_$v (api String, from Int32, to Int32, ncmt Int32, nauth Int32, nproj Int32) ENGINE = MergeTree() ORDER BY from" |clickhouse-client --host=$h
1561 |    echo "CREATE TABLE api_all AS api_$v ENGINE = Distributed(da, default, api_$v, rand())" | clickhouse-client --host=$h
1562 | 
1563 |   echo "CREATE TABLE commit_$v (sha1 FixedString(20), time Int32, tc Int32, tree FixedString(20), parent String, taz String, tcz String, author String, commiter String, project String, comment String) ENGINE = MergeTree() ORDER BY time" |clickhouse-client --host=$h
1564 |   echo "CREATE TABLE commit_all AS commit_$v ENGINE = Distributed(da, default, commit_$v, rand())" | clickhouse-client --host=$h
1565 | done
1566 | ```
1567 | 
1568 | Then we import data into each of these five tables:
1569 | ```bash
1570 | v=u
1571 | for j in {0..4}
1572 | do da=da$j
1573 |   for i in $(eval echo "{$j..31..5}")
1574 |   do echo "start inserting $da file $i"
1575 |     time /da?_data/basemaps/gz/Pkg2stat$i.gz | ~/lookup/chImportPkg.perl | clickhouse-client --max_partitions_per_insert_block=1000 --host=$da --query 'INSERT INTO api_u FORMAT RowBinary'
1576 |   done
1577 |   for i in $(eval echo "{$j..127..5}")
1578 |   do echo "start inserting $da file $i"
1579 |     time zcat /da?_data/basemaps/gz/c2chFullU$i.s | ~/lookup/chImportCmt.perl | clickhouse-client --max_partitions_per_insert_block=1000 --host=$da --query 'INSERT INTO commit_u FORMAT RowBinary'
1580 |   done
1581 | done 
1582 | ```
1583 | 
1584 | Once the data is in there we can query commits
1585 | ```bash
1586 | clickhouse-client --host=da3 --query 'select count (*) from commits_all'
1587 | 2061780191
1588 | ```
1589 | or APIs:
1590 | ```bash
1591 | echo "select api,ncmt, nauth, nproj from api_all where match(api, 'stdio') and nauth > 100 limit 3 FORMAT CSV" |clickhouse-client --host=da3 --format_csv_delimiter=";"
1592 | "C:stdio_ext.h";153898;4797;1671
1593 | "C:ustdio.h";9995;1107;230
1594 | "C:vcl_cstdio.h";5868;163;9
1595 | ```
1596 | 
1597 | 	
1598 | It works fast if we specify specific time or an interval:
1599 | ```bash
1600 | clickhouse-client --host=da3 --query 'select author,comment from commits_all where time=1568656268'
1601 | Matt Davis <mw.davis@hotmail.co.uk>     Made some SEO improvements and also added comments outlining what is contained in each section.\n
1602 | Jessie 1307 <295101171@qq.com>  First Commit\n
1603 | �
1604 |  �� <910063@gmail.com>  0917\n
1605 | nodemcu-custom-build <vladurash@yahoo.com>      Prepare my build.config for custom build
1606 | zzzz1313 <zaki56@rambler.ru>    Initial commit
1607 | Erik Faye-Lund <erik.faye-lund@collabora.com>   .mailmap: add an alias for Sergii Romantsov\n
1608 | Paulus Pärssinen <paulus.parssinen01@edupori.fi>       Initial commit
1609 | AnnaLub <yaskrava@gmail.com>    get all tickets command impl\n
1610 | ```
1611 | 
1612 | We may want to match on commit comment
1613 | ```bash
1614 | echo "select lower(hex(sha1)),author, project, comment from commit_all where match(comment, 'CVE-2021') limit 3 FORMAT CSV" |clickhouse-client --host=da3 --format_csv_delimiter=";"
1615 | "Florian Westphal <fw@strlen.de>";"Jackeagle_kernel_msm-3.18";"netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore\nxt_replace_table relies on table replacement counter retrieval (which__NEWLINE__uses xt_recseq to synchronize pcpu counters).\nThis is fine, however with large rule 
1616 | set get_counters() can take__NEWLINE__a very long time -- it needs to synchronize all counters because__NEWLINE__it has to assume concurrent modifications can occur.\nMake xt_replace_table synchronize by itself by waiting until all cpus__NEWLINE__had an even seqcount.\nThis allows a followup patch to copy the cou
1617 | nters of the old ruleset__NEWLINE__without any synchonization after xt_replace_table has completed.\nCc: Dan Williams <dcbw@redhat.com>__NEWLINE__Reviewed-by: Eric Dumazet <edumazet@google.com>__NEWLINE__Signed-off-by: Florian Westphal <fw@strlen.de>__NEWLINE__Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org
1618 | >\n(cherry picked from commit 80055dab5de0c8677bc148c4717ddfc753a9148e)__NEWLINE__Orabug: 32709122__NEWLINE__CVE: CVE-2021-29650__NEWLINE__Signed-off-by: Sherry Yang <sherry.yang@oracle.com>__NEWLINE__Reviewed-by: John Donnelly <john.p.donnelly@oracle.com>__NEWLINE__Signed-off-by: Somasundaram Krishnasamy <somasu
1619 | ndaram.krishnasamy@oracle.com>"
1620 | "Joe Yu <joe.yu@unisoc.com>";"daedroza_aosp_development_sony8960_n";"Fix storaged memory leak\nCVE-2021-0330 : (AOSP) EoP Vulnerability in Framework / storaged__NEWLINE__A-170732441__NEWLINE__Mot-CRs-fixed: (CR)\nstoraged try to load user's proto even if it has been loaded before\nhttps://partnerissuetracker.corp
1621 | .google.com/u/2/issues/118719575\nChange-Id: Ia7575cdc60e82b028c6db9a29ae80e31e02268b1__NEWLINE__(cherry picked from commit 857a63eb6604baa1ed6b0e31839ccce8da18c716)__NEWLINE__Signed-off-by: Mark Salyzyn <salyzyn@google.com>__NEWLINE__Bug: 170732441__NEWLINE__Test: compile__NEWLINE__(cherry picked from commit 8ec
1622 | 2afb91400818b0a8843b8917c05aba75b00db)__NEWLINE__Reviewed-on: https://gerrit.mot.com/1843719__NEWLINE__SLTApproved: Slta Waiver__NEWLINE__SME-Granted: SME Approvals Granted__NEWLINE__Tested-by: Jira Key__NEWLINE__Reviewed-by: Konstantin Makariev <kmakariev@motorola.com>__NEWLINE__Submit-Approved: Jira Key"
1623 | "Joe Yu <joe.yu@unisoc.com>";"daedroza_aosp_development_sony8960_n";"Fix storaged memory leak\nCVE-2021-0330 : (AOSP) EoP Vulnerability in Framework / storaged__NEWLINE__A-170732441__NEWLINE__Mot-CRs-fixed: (CR)\nstoraged try to load user's proto even if it has been loaded before\nhttps://partnerissuetracker.corp
1624 | .google.com/u/2/issues/118719575\nChange-Id: Ia7575cdc60e82b028c6db9a29ae80e31e02268b1__NEWLINE__(cherry picked from commit 857a63eb6604baa1ed6b0e31839ccce8da18c716)__NEWLINE__Signed-off-by: Mark Salyzyn <salyzyn@google.com>__NEWLINE__Bug: 170732441__NEWLINE__Test: compile__NEWLINE__(cherry picked from commit 8ec
1625 | 2afb91400818b0a8843b8917c05aba75b00db)__NEWLINE__Reviewed-on: https://gerrit.mot.com/1844255__NEWLINE__SLTApproved: Slta Waiver__NEWLINE__SME-Granted: SME Approvals Granted__NEWLINE__Tested-by: Jira Key__NEWLINE__Reviewed-by: Konstantin Makariev <kmakariev@motorola.com>__NEWLINE__Submit-Approved: Jira Key"
1626 | ```
1627 | commit sha1's are binary, so to print them we need to process, e.g., 
1628 | ```bash
1629 | clickhouse-client --host=da1 --query 'select sha1, author,comment from commits_all where time=1568656268 limit 1 format RowBinary' | perl -ane '$sha1=substr($_, 0, 20); $o=unpack "H*", $sha1; $rest=substr($_,21,length($_)-21); print "$o;$rest\n";' 
1630 | fbb7add2a58b733a797d97a1e63cb8661702d0a3;zzzz1313 <zaki56@rambler.ru>Initial commit
1631 | ```
1632 | Alternatively, we can hex them in the select statement:
1633 | ```bash
1634 | clickhouse-client --host=da1 --query "select lower(hex(sha1)),author,comment from commits_all where match(comment, '^(CVE-(1999|2\d{3})-(0\d{2}[0-9]|[1-9]\d{3,}))$') limit 2 format CSV" 
1635 | "024fbd8de50c1269d178c3ee6b8664f5eee7f57b","nickmx1896 <nickmx1896@Hotmail.com>","CVE-2016-2355"
1636 | "209446bab86e996d58c233abee0376cb26dcd4c4","jonathanliem94 <jonathanliem94@gmail.com>","CVE-2017-4963"
1637 | ```
1638 | 
1639 | We can create additional tables, so that the time filtering could be fast, for example for projects:
1640 | 
1641 | ```
1642 | clickhouse-client --max_partitions_per_insert_block=1000 --host=da3 --query "CREATE TABLE c2p_$i (date Date, sha1 FixedString(20), np UInt32, p String) ENGINE = MergeTree(date, sha1, 8192)"
1643 | for j in {3..31..4}; do time ./importc2p.perl $j | clickhouse-client --max_partitions_per_insert_block=1000 --host=da3 --query "INSERT INTO c2p_$j (date, sha1, np, p) FORMAT RowBinary"; done 
1644 | ```
1645 | 
1646 | 	
1647 | 	
1648 | ## Python Clickhouse API
1649 | 
1650 | CH API is disabled in the current version of oscar:   make a
1651 | separate module - see draft in lookup/oscarch.py)
1652 | 
1653 | 
1654 | There are classes in oscar.py that allow for querying the clickhouse database:  
1655 | 1. `Time_commit_info(tb_name='commits_all', db_host='localhost')` - commits
1656 | 	* `.commit_counts(start, end=None)` - get the count of the commits given a time interval
1657 | 	* `.commits_iter(start, end=None)` - get the commits as 'Commit' objects in a generator
1658 | 	* `.commits_shas(start, end=None)` - get the sha1 of the commits in a list
1659 | 	* `.commits_shas_iter(start, end=None)` - get the sha1 of the commits in a generator
1660 | 2. `Time_projects_info(tb_name='b2cPtaPkgR_all', db_host='localhsot')` - \*projects
1661 | 	* `.get_values_iter(cols, start, end)` - query columns given a time interval (generator)
1662 | 	* `.project_timeline(cols, repo)` - query columns of a given project name, sorted by time (generator)
1663 | 	* `.author_timeline(cols, author)` - query columns of a given author, sorted by time (generator)
1664 | 	
1665 | \*note that the *b2cPtaPkgR_all* table currently does not contain projects that uses the following programming languages: php, Lisp, Sql, Fml, Swift, Lua, Cob, Erlang, Clojure, Markdown, CSS
1666 | 
1667 | The structures of the databases are listed below:  
1668 | **commits_all:**  
1669 |         |__name___|______type_______|
1670 |         | sha1    | FixedString(20) |
1671 |         | time    | Int32           |
1672 |         | timeCmt | Int32           |
1673 |         | tree    | FixedString(20) |
1674 |         | parent  | String          |
1675 |         | TZAuth  | String          |
1676 |         | TZCmt   | String          |
1677 |         | author  | String          |
1678 |         | commiter| String          |
1679 |         | project | String          |
1680 |         | comment | String          |
1681 | 
1682 | **b2cPtaPkgR_all:**  
1683 | | name     | type            |
1684 | |----------|-----------------|
1685 | | blob     | FixedString(20) |
1686 | | commit   | FixedString(20) |
1687 | | project  | String          |
1688 | | time     | UInt32          |
1689 | | author   | String          |
1690 | | language | String          |
1691 | | deps     | String          |
1692 | 
1693 | We can use the `commit_counts` to query the count of the commits given a time interval:
1694 | ```python
1695 | >>> from oscarch import Time_commit_info
1696 | >>>
1697 | >>> t = Time_commit_info()
1698 | >>> t.commit_counts(1568656268)
1699 | 9
1700 | ```
1701 | The `commits` method can be used to iterate through commit objects given a time interval:
1702 | ```python
1703 | >>> t = Time_commit_info()
1704 | >>> commits = t.commits(1568656268)
1705 | >>> c = commits.next()
1706 | >>> type(c)
1707 | <class 'oscar.Commit'>
1708 | >>> c.parent_shas
1709 | ('9c4cc4f6f8040ed98388c7dedeb683469f7210f5',)
1710 | ```
1711 | The `commits_shas` method can be used to iterate through commit hashes given a time interval:
1712 | ```python
1713 | >>> t = Time_commit_info()
1714 | >>> for sha1 in t.commits_shas(1568656268):
1715 | ...     print(sha1)
1716 | 0a8b6216a42e84d7d1e56661f63e5205d4680854
1717 | 874d92e732d79d0d8bafb1d1bcc76a3b6d81302f
1718 | ccf1a5847661de2df791a5a962e3499a478f48ab
1719 | 39927c70a99f0949c1de3d65a2693c8768bc4e0f
1720 | fbb7add2a58b733a797d97a1e63cb8661702d0a3
1721 | ...
1722 | ```
1723 | 
1724 | The `b2cPtaPkgR_all` table stores the information associated to each **commit**.  
1725 | For `b2cPtaPkgR_all` table, use `get_values_iter` of the `Time_projects_info` class to queries for columns in a given time interval:
1726 | ```python
1727 | >>> from oscar import Time_project_info as Proj
1728 | >>> p = Proj()
1729 | >>> rows = p.get_values_iter(['time','project'], 1568571909, 1568571910)
1730 | >>> for row in rows:
1731 | ...     print(row)
1732 | ...
1733 | (1568571909, 'mrtrevanderson_CECS_424')
1734 | (1568571909, 'gitlab.com_surajpatel_tic_toc_toe')
1735 | (1568571909, 'gitlab.com_surajpatel_tic_toc_toe')
1736 | ...
1737 | ```
1738 | `project_timeline` can be used to query for a specific repository. The result shows the time of the commit and the name of the commit repo sorted by time:
1739 | ```python
1740 | >>> rows = p.project_timeline(['time','project'], 'mrtrevanderson_CECS_424')
1741 | >>> for row in rows:
1742 | ...     print(row)
1743 | ...
1744 | (1568571909, 'mrtrevanderson_CECS_424')
1745 | (1568571909, 'mrtrevanderson_CECS_424')
1746 | (1568571909, 'mrtrevanderson_CECS_424')
1747 | ...
1748 | ```
1749 | 
1750 | It might be useful to examine the dependencies (i.e. includes in C or imports in Python) for each commit.  
1751 | The snippet below shows the time, repo name, language, and dependencies for each commit. Note that the commits are sorted by time and the dependencies are separated by semicolon.
1752 | ```python
1753 | >>> rows = p.get_values_iter(['time', 'project', 'language', 'deps'], 1568571915, 1568571916)
1754 | >>> for row in rows:
1755 | ...     print(row)
1756 | ...
1757 | (1568571916, 'Nakwendaa_neural-network', 'PY', 'numpy\n')
1758 | (1568571916, 'Nakwendaa_neural-network', 'PY', 'os;pickle;numpy;time;matplotlib.pyplot;gzip;Mlp.Mlp\n')
1759 | ```  
1760 | Similarily, `author_timeline` queries for a specific author:
1761 | ```python
1762 | >>> rows = p.author_timeline(['time', 'project'], 'Andrew Gacek <andrew.gacek@gmail.com>')
1763 | >>> for row in rows:
1764 | ...     print(row)
1765 | ...
1766 | (49, 'smaccm_camera_demo')
1767 | (677, 'smaccm_vm_hack')
1768 | (1180017188, 'teyjus_teyjus')
1769 | ...
1770 | ```
1771 | 
1772 | # Considerations on performance
1773 | 
1774 | 1. getValues and showCnt are not supposed to be called every second,
1775 |    the keys are passed through from standard input and one line is
1776 |    generated for each key (get values also passes attributes,
1777 | ```
1778 | echo "k;a" | getValues k2v
1779 | ```
1780 | produces one line
1781 | ```
1782 | k;a;v0;v1;..vn
1783 | ```
1784 | For blobs, it is possible to export the entire content as a single
1785 | bas64 encoded line
1786 | 
1787 | 2. Operations that require iteration over all keys or values (e.g., match a pattern) over all
1788 | is faster via flat files
1789 | ```
1790 | for i in {0..127}; do zcat /da?_data/basemaps/gz/k2vFullUi.s; done | grep PATTERN
1791 | ```
1792 | If the iteration is over commit content, use 
1793 | ```
1794 | cd /da5_data/All.blobs/
1795 | for i in {0..127}; do perl ~/lookup/lstCmt.perl 9 $i; done
1796 | ```
1797 | If iteration is over blob content 
1798 | 
1799 | ```
1800 | cd /da5_data/All.blobs/
1801 | for i in {0..127}; do perl ~/lookup/lstBlob.perl $i; done
1802 | ```
1803 | 
1804 | 3. For a very large number of exact keys (over 500K) it is faster to use unix join 
1805 | (simply split (via splitSec.perl for hashes and splitSecCh.perl for strings), sort each piece
1806 | and use unix join:
1807 | ```
1808 | for i in {0..127}; do zcat /da?_data/basemaps/gz/k2vFullUi.s |join -t\; <(zcat piece$i) -; done
1809 | ```  
1810 | 


--------------------------------------------------------------------------------
/ShellGuide.md:
--------------------------------------------------------------------------------
  1 | Hello and welcome to World of Code!
  2 | 
  3 | World of Code can be navigated using the Linux shell.
  4 | Let’s go over some of the most commonly used Linux shell commands together!
  5 | 
  6 | As a quick note this guide assumes you have used secure shell to connect to da0 by typing “ssh da0” on the appropriate command line as previously
  7 | stated in this World of Code tutorial and have done nothing else.
  8 | 
  9 | Once you are here you will be greeted by a prompt similar to this: “[wparham1@da0]~%” except instead of wparham1 you will see the username
 10 | registered to you during World of Code signup!
 11 | 
 12 | Let’s start learning how to navigate World of Code by using the shell.  If you are new to using a Linux terminal, I encourage you to follow along or follow
 13 | this link "https://www.hostinger.com/tutorials/linux-commands" to further familiarize yourself with shell commands.
 14 | 
 15 | *Note: In this tutorial I will type all commands in quotes but you should not include these quotes when operating with the shell unless specified otherwise.
 16 | 
 17 | Now that we are on the same page, let’s begin by typing the command “ls” without the quotes and pressing enter.
 18 |   •	The “ls” command stands for “list” and it will list all contents of a directory.
 19 |     o	The command can be further specified by using flags after the command.  For instance, let’s type the long flag denoted by “ls -l” next.
 20 |       This will give us a more verbose output telling us the privileges on the file in the first column, then the number of links to the file in 
 21 |       the second column, the owner of the file in the third column, then the group of people to who own the file in the fourth column, 
 22 |       the size of the file in bytes in the fifth column, the date the file was last changed in that directory in the sixth, seventh, and eight column, and 
 23 |       finally the file name in the ninth column.
 24 |       
 25 |     o	Now, let’s try typing the “all” flag denoted by “ls -a”. This will show all contents of a directory including files Linux would 
 26 |       normally hide from the user such as “.” And “..”.
 27 |       
 28 |     o	Now, let’s combine both the “ls -l” flag and the “ls -a” flag. This will result in the verbose, tabular listing of every file inside a directory. 
 29 |       This is meant to demonstrate that you can use any valid combination of flags together with any given Linux command! 
 30 |       
 31 | Now that you have closely examined a most likely blank directory lets go ahead and talk about creating files and directories from the command line.
 32 |   •	The first and arguably most useful command when creating files will be “mkdir [directory name]”. 
 33 |     Use this command for yourself by replacing the [directory name] with whatever you would like to name this directory. 
 34 |     For this example, I recommend naming it “shell_tutorial”. That means the command would look like this: “mkdir shell_tutorial”. 
 35 |     To confirm that you created the directory use the “ls” command we talked about above. Type “ls” into the terminal and press enter to 
 36 |     see your new directory, good job!
 37 |     
 38 |   •	Another common way to create files is by using the Vim text editor. The Vim text editor, while extremely handy, can be a little tricky to get used to.
 39 |     First, type “vim [file name]” into the terminal and press enter.  Again, feel free to replace [file name] with whatever you want 
 40 |     but I recommend replacing it with “vim_tutorial.txt”.  In that case, the command would look like this: “vim vim_tutorial.txt”.  
 41 |     This will open up a new window which will be the file you are editing! To edit the file first press the “i” key. 
 42 |     This tells vim to swap to text editor mode instead of edit mode.  Now type “This is a blank vim file.” and press the escape key on your keyboard.  
 43 |     The escape key tells Vim to take you out of text editor mode and puts you in edit mode.  Next, type “:wq”.  In vim, the “w” stands for save work and 
 44 |     the q stands for quit.  If you ever desire to quit without saving then type the command “:q!”.
 45 |     
 46 |   •	Another extremely useful and extremely common way to create a file is to copy it from an existing file!  
 47 |     This can be done by using the “cp” or “copy command”.  For instance, if we wanted to make a copy of “vim_tutorial.txt” named “vim_tutorial_copy.txt” all 
 48 |     we would have to do is type “cp vim_tutorial.txt vim_tutorial_copy.txt” and make sure that the file we want to copy is in the same directory we are in.
 49 |     You can create copies of a file from a different directory by specifying the path to the directory you wish to copy the file from. For instance, if 
 50 |     I am inside the directory “shell_tutorial” and I want to copy a file named “cool_file.txt” but it is one directory higher than “shell_tutorial” all 
 51 |     I would have to do is type: “cp ../cool_file.txt cool_file_copy.txt”
 52 |     
 53 |     o	*Note: You should be careful when creating a copy of a file as it copies the entire contents of a file so if you have very big files, you will be 
 54 |       making very large copies.
 55 |   •	The final command I would like to talk about here is the “mv” command which stands for “move command”.  
 56 |     This command lets you move a file from one location to another location or even rename a file!  For instance, if I wanted to move “vim_tutorial.txt” to 
 57 |     the directory above “shell_tutorial” I could navigate inside of “shell_tutorial” using the “cd” command then type the following command:       
 58 |     “mv vim_tutorial.txt ..”.  In this command the first parameter specifies the file in question and the second parameter specifies the location in question.
 59 |     If I wanted to move “vim_tutorial.txt” back inside of “shell_tutorial” all I would have to do is type:  “mv ../vim_tutorial.txt .”.  
 60 |     In this case, the first argument is the path to the file I wish to move and the second argument is the “.” (dot) specifying I want the file moved to 
 61 |     this directory.
 62 |     
 63 |     o	The final note I want to make on the “mv” command is its ability to rename files.  In order to do this all you need to do is type: 
 64 |       “mv [name of file to be changed] [new name]”.  If I wanted to change the name of “vim_tutorial.txt” to “awesome_vim.txt” I would type:
 65 |       mv “vim_tutorial.txt” “awesome_vim.txt”.
 66 |       
 67 | Now that we have created a file a few different ways, we can admire our hard work!  To do this let’s use the “cat”, “head”, and “tail” commands!
 68 |   •	First, let’s use “cat”.  This command stands for concatenate and will print the contents of a file to “standard out” (file descriptor 1) which is a fancy
 69 |     way to say it will print to terminal.  To use cat on the file we created type the following command: “cat vim_tutorial.txt”.  
 70 |     This will display the contents of the file “vim_tutorial.txt”.
 71 |     
 72 |     o	*Note: You should be careful when using “cat” as it can quickly flood your terminal with output if you cat a file that is too large.
 73 |     
 74 |   •	Second, let’s use “head” to display the content of our “vim_tutorial.txt” file.  “Head” stands for the “head” of a file.  
 75 |     It will print the first 10 lines of a file to standard out (stdout).  In order to use it on the file we just created type: “head vim_tutorial.txt”.
 76 |     
 77 |     o	Head also has flags associated with it the same way “ls” has the “-l” and “-a” flags.  Head lets the user specify how many lines from the top of the 
 78 |       file the user would like to see.  For instance, if we type “head -5 vim_tutorial.txt” only the first five lines of “vim_tutorial.txt” will be shown!  
 79 |       This is especially useful if you have a file so large that opening the file and looking at the first N arbitrary number of lines will take a long time
 80 |       as can be the case in World of Code.  The best part is that you can also specify a number of lines greater than 10 for head to print.  
 81 |       For instance, “head -1000 vim_tutorial.txt” would print the first one thousand lines of “vim_tutorial.txt” if it had that many lines!
 82 |       
 83 |   •	Finally, let’s use “tail” to display the content of our “vim_tutorial.txt” file.  “Tail” works like the opposite of head and stands for tail of file.
 84 |     It will try to print the final 10 lines of any given file.  In order to print “vim_tutorial.txt” go ahead and enter “tail vim_tutorial.txt”.
 85 |     
 86 |     o	Tail also has the ability to specify how many lines of output you desire from thee end of your file.  To do this use the -x flag where x is a 
 87 |       given number.  For instance, typing “tail -5 vim_tutorial.txt” would print the final five lines of “vim_tutorial.txt”.
 88 | 
 89 | Now that we know how to create files and directories let’s learn how to traverse these directories!  
 90 | The simplest way to traverse directories is by moving through them one directory at a time.  
 91 | To do this we can make use of the “cd” command commonly referred to as the “Change Directory” command.
 92 | 
 93 |   •	First, let’s navigate into the directory we created earlier!  In order to do this, we need to type “cd [name of existing directory]”.  
 94 |     If you named the directory “shell_tutorial” then the command would look like this: “cd shell_tutorial”.  
 95 |     Once you are in this directory you can once again make use to “ls” to see what is in it.  In this case I recommend using
 96 |     the “-a” flag (“ls -a”) in order to see the “.” And “..” folder.  These folders hold valuable information for traversing directories.  
 97 |     The singular dot “.” Means “this directory” while the double dots mean “previous directory”.
 98 |     
 99 |   •	Next, let’s navigate back to the directory we started in.  We can do this by typing   “cd ..”.  
100 |     This command will move us back up one directory.
101 |   
102 |   •	Next let’s understand how the tilde (~) character works.  In Linux the tilde character will try to fill in relevant path information if it has not 
103 |     been specified.  This can be used in a variety of ways including fast directory traversal.  I will cover it more as it becomes relevant.
104 |     
105 |   •	*Note: You can use the “cd ..” command from your home directory to move up a level and see everyone on that da server.  
106 |     For instance, I am usually on da0.  If you wanted to move to my directory you could simply ssh into da0 and type “cd ../wparham1/” to see my home 
107 |     directory and its contents.  If you wanted to go back to your directory from this location all you would need to do is type “cd ../[your username]/”
108 | 
109 | Now we can talk about deleting files and directories.  To do this we will use two commands: rmdir and rm.
110 |   •	Starting with the rmdir command first, it can be used to remove empty directories.  To test this let’s go ahead and create a directory named 
111 |     “short_lived_dir”.  To do this type “mkdir short_lived_dir”.  To make sure that the directory is there you can type “ls”.  
112 |      Next, in order to delete the empty directory, we will type “rmdir_short_lived_dir”.  Upon checking using “ls” we can see that the directory 
113 |      “short_lived_dir” no longer exists!
114 |      
115 |   •	Next, lets look at using the rm command.  This command can be used to both delete directories and files but let’s look at how to delete files first.
116 |     o	To delete a file simply type “rm [filename]”.  However, you need to be careful when you do this because there is no undo.  
117 |       Deleting anything in the shell is usually a permanent action.  For demonstration purposes let’s go ahead and create a copy of “vim_tutorial.txt” 
118 |       named “short_lived_tutorial.txt”.  To do this go to the directory with “vim_tutorial.txt” and type: “cp vim_tutorial.txt short_lived_tutorial.txt”.  
119 |       Next, we can use “ls” to check if our copy exists.  After confirming its existence, lets delete it.  
120 |       To do this all you will need to enter is “rm short_lived_tutorial.txt”.  This will permanently delete the copy.
121 |       
122 |     o	To delete a directory with contents we will need to use the rm command with the -r flag.  This specifies a recursive directory traversal to 
123 |     delete all content inside the directory.  To illustrate this let’s make a copy of the “shell_tutorial” directory.  
124 |     We can do this by typing “cp -r shell_tutorial short_lived_tutorial”.  The -r flag is specifying a recursive copy allowing us to copy the entire 
125 |     directory and its contents!  Now, to delete this directory we need to type: “rm -r short_lived_tutorial”.  
126 |     This will immediately delete the directory specified and no longer grant access to any files it contained.
127 |     
128 | 	  *Note:  You should ALWAYS double check that you are deleting the correct directory when you specify the -r flag.  If you specify the wrong directory you could delete your entire home directory losing all work that isn’t saved elsewhere on a different machine, GitHub repo, etc.  For this reason, you should never, under any circumstance, use the “.” or “..” folders when specifying a directory to delete.  It is much safer to navigate to the directory you want to delete and specify the directory name.
129 | 
130 | The next command I would like to talk about is the “grep” command.  The grep command allows a user to search through a file for a given string.
131 |   •	For instance, if you wanted to search for the word “blank” inside of “vim_tutorial.txt” you could type 
132 |     “grep blank vim_tutorial.txt”.  This will print all lines that contain the string “blank” to standard out.  
133 |     This command is extremely useful for searching and filtering when using World of Code particularly in x2y type mappings.
134 |     
135 |   •	You can also combine grep with regex expressions to expand the range of what you could possibly grep.  
136 |     A great example of this is: grep -iE ';code\W?(of)?\W?conduct'.  In this the -i flag specifies to ignore the case of the grepped expression and 
137 |     the -E flag allows you to grep with a regex expression.
138 |     
139 | The next useful general purpose command is “wc” which is short for “word count”.  
140 | Word count allows you to see how many lines, words, and bytes are in a file.  This, much like head and tail, is particularly useful for the
141 | potentially huge files you can encounter when using World of Code (8 gb of project names anyone? I digress).
142 |   •	Thankfully, “wc” is an extremely intuitive command to use.  If we call “wc” without any flags it will return to us the number of lines, words, and bytes 
143 |     in a file in that order.  For instance, “wc vim_tutorial.txt” will result in the following output: “1  5  26  vim_tutorial.txt”.
144 | 
145 | •	We can also specify whether we want the number of lines using “-l”, the number of words using “-w”, and the number of characters using “-c”.  
146 |   For instance, “wc -l vim_tutorial.txt” will result in the following output: “1  vim_tutorial.txt”.
147 | 
148 | Also worth mentioning is the “clear” command.  Many times, you will accidentally flood your terminal with output by catting a file that was too large or 
149 | running a command to stdout that should have been redirected to a file.  When you do this it can be helpful to clear that output and get a clean terminal 
150 | screen so you can better keep track of where you are in a directory, the last command you executed, etc.  
151 | To do this all you need to do is type “clear” into the terminal and press enter.  This will give you a completely clean terminal to work in.
152 | 
153 | The next thing we will look at is simple output redirection.  With this anything that is printed to the terminal or stdout can be redirected to a file.  
154 | This can be extremely useful if you don’t want to do formatted write to file calls in a programming language and instead would rather just write to standard
155 | out and redirect it to a file!
156 | 
157 |   •	To understand this let’s start with an example.  Since we know how to cat a file to print its contents to standard out we can start there.  
158 |     First, navigate to the file holding “vim_tutorial.txt” and then cat it by entering “cat vim_tutorial.txt”.  
159 |     Once you have done that take it a step further and enter “cat vim_tutorial.txt > my_new_file.txt”.  
160 |     Upon executing this command, you may notice that there is no output to the terminal.  This is because we have redirected 
161 |     the output to the file specified after the “>”.  The file after the “>” will always be created or overwritten depending on the file’s 
162 |     previous existence.  This can be important because it means you can very easily overwrite a file you have already created if you redirect into
163 |     a file with the same name.  Watch out for this.
164 |     
165 | Now we will cover a few complicated but equally useful examples that pertain to World of Code specifically.
166 |   •	The first thing we will cover is how to retrieve a list of all projects and deforked projects from version U of World of Code.
167 |     o	This can be done by entering the following command on the command line: “zcat /da?_data/basemaps/gz/p2PU.s ”.  
168 |       This will create a comprehensive list of all projects and map to their deforked counter parts separated by a “;”.  
169 |       This makes it convenient to tokenize each line using your preferred programming language to look at each individual piece.  
170 |       This also makes it convenient to use the “cut” command in linux to only grab the portion of the line you are interested in.  
171 |       As a quick note this does not guarantee that each project is unique.  If you need a list of unique project you should pipe either your cut version 
172 |       or the full version into the appropriate “uniq” Linux filter.
173 |       
174 |   *Note: An example of this will be added at a later date 
175 | 
176 |   •	The next thing we will cover is how to query and export a list from the MongoDB database contained in World of Code.  
177 |   
178 |     o	First, enter MongoDB.  If you are on da1 you simply type “mongo”.  If you are on a different server you can type “mongo –host “da1.eecs.utk.edu”.  
179 |       Next, specify that we want to use World of Code by entering “use WoC”.  Now we can either interpret from A_metadata or P_metadata.  
180 |       A_metadata stands for author meta data and P stands for project meta data.  In this example we will be using Project meta data or “P_metadata”.  
181 |       To understand how to create this mongo export we can look at the following command: 
182 |     	"mongoexport -h da5 -d WoC -c P_metadata.U -f ProjectID,NumActiveMon,NumAuthors,NumCommits,Gender -o dump_with_gender.csv --type=csv"
183 |       
184 |     o	While this command is very long it is thankfully quite easy to understand once it is broken down.  
185 |       First, mongoexport tells the MongoDB database that we want to export a list from it.  The “-h” specifies that we want the host to be da5.  
186 |       The “-d” specifies the WoC database.  The “-c” specifies the collection as P_metadata.U, and the -f specifies the fields we desire.  
187 |       After the -f you should enter the fields you would like to be exported in that order without spaces.  
188 |       The “-o” specifies a file to write to, in this case it is “dump_with_gender.csv”.  
189 |       Finally, the “--type=csv” specifies that we want this file in a csv or comma separated value format.  
190 |       This is the same way Excel files are formatted for those with Excel experience!
191 | 
192 |     o	A few important things to note about how mongo generates these files.  If mongo does not have all the information necessary to populate each field it 
193 |       will fill the field in with ‘’.  This is important if you want to tokenize on commas and look at each value.  
194 |       Also, when you specify gender sometimes Mongo will give you the number of females, males, or both.  It tries to give you all the information it
195 |       has but the amount of information can be inconsistent.  It is good to be prepared for this issue.  
196 |       Lastly, when you query Mongo in this way it will give you a result for every project it has meaning it will return a very large file.  
197 |       Be prepared to wait a few minutes while it is generating this file and be prepared to interpret this file.  
198 |       In many scenarios you cannot view it as just another Excel file because it will be too large for the Excel grid.  
199 |       I have also had it crash my Visual Studios when trying to view it even after assigning an appropriate amount of memory.
200 | 
201 |   •	Finally, I would like to look at how we can fetch files off of World of Code without using a third party like a GitHub repoitory especially if the
202 |     file is too big for GitHub and GitHub’s Large File Storage (LFS).
203 |     
204 |     o	To do this we can use the “scp” command from the terminal on our local pc.  This means that we do not run the following command on the WoC servers,
205 |       we run it on our local machine. This is an example of me using scp on my command line: "scp wparham1@da0:~wparham1/dump_with_gender.csv ~/Downloads/"
206 |       
207 |     o	The “scp” command stands for “secure copy”.  In order to use it you will need ssh permission to the location you wish to copy from, in this case World
208 |     of Code.  To use this command first specify your username then @ the server you wish to connect to.  
209 |     The second parameter is the path of the file you wish to copy.  In my case this would be “~wparham1/dump_with_gender.csv”.  
210 |     The second parameter is the location on your local computer you wish to copy this file to.  In my case this would be “~/Downloads/”.  The scp command in a
211 |     generalized form would be as follows: scp [WoC username]@[WoC server you ssh into] [path to file you want to download] [path to location you want to download to].
212 | 


--------------------------------------------------------------------------------
/wochardware.md:
--------------------------------------------------------------------------------
 1 | # WoC Hardware
 2 | 
 3 | |hostame|CPU|RAM|HDD|SSD|
 4 | |--|--|--|--|--|
 5 | |da[0-2]| Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz - 24 cores | 396GB RAM | 35T HDD||
 6 | |da3| Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00GHz - 8 cores | 396GB RAM | 70T HDD |15TBSDD|
 7 | |da4|Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz - 16 cores |792GB RAM |90T HDD |15TBSDD|
 8 | |da5|1Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz - 80 cores|1.320TB RAM|124T HDD | 48TB SDD|
 9 | |da6|Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz - 40 cores + 4 Nvidia Tesla V100-SXM2-32GB|256GiB RAM||1.9TSSD|
10 | |bb0[^1]|||305TB HDD||
11 | |bb1[^1]|||471TB HDD||
12 | |da7[^1]|Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz x2= 32 HW threads|384GiB RAM|655TB HDD ZFS ~550TB usable in 45-wide raidz2||
13 | |da8[^1]|Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz x2= 32 HW threads|384GiB RAM|655TB HDD ZFS ~510TB usable in 3x15-wide raidz2||
14 | 
15 | [^1]: These systems are primarily for storage and NOT recommended for running jobs. Access may not be available to all users, and they do not use the same mount points.
16 | 


--------------------------------------------------------------------------------