├── .classpath ├── .project ├── README.md ├── build.xml └── src ├── AndNot.java ├── Intersect.java ├── Or.java └── PositionalWithinK.java /.classpath: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------- /.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | Merge Algorithms 4 | 5 | 6 | 7 | 8 | 9 | org.eclipse.jdt.core.javabuilder 10 | 11 | 12 | 13 | 14 | 15 | org.eclipse.jdt.core.javanature 16 | 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CS276 merge algorithm exercises 2 | 3 | Whenever we are performing a query with more than one word, we have to perform 4 | some kind of combination of the results for documents that do or do not 5 | contain each word. 6 | 7 | This exercise works through some of the cases that you should understand. 8 | 9 | In the first class, we spent almost all the time talking about Google.... 10 | A large search engine has thousands of machines and can keep a ton of stuff 11 | in memory. But we should also consider small search systems, such as Spotlight 12 | search on your Mac or the equivalent Windows Search for Windows. 13 | 14 | **1.** 15 | Do standard Linux distributions provide a search engine for your computer? 16 | Your Linux geek friend says that it's easy and that if you want to search for 17 | files with "apple" and "computer" in them, all you need to do is type: 18 | ```bash 19 | comm -12 <(grep -Riwl "apple" /) <(grep -Riwl "computer" /) 20 | ``` 21 | Is that a good solution? 22 | 23 | If we want to provide a good text search system on a small machine: 24 | 1. For speed, we need to index. 25 | 2. For a good trade-off on memory use/time efficiency, we probably keep the dictionary in memory 26 | but we need to keep all the postings lists on disk 27 | 3. For acceptable speed and memory use when working with postings lists on disk, we need to have an algorithm 28 | that streams postings from disk and just iterates once forward through the 29 | postings list. A hard disk does not support efficient random access. 30 | Algorithms that do this for two or more lists simultaneously are referred to 31 | as "merge algorithms". The name is maybe misleading, since we sometimes, e.g., 32 | intersect rather than merging lists, but the name is traditional. 33 | 4. The secret to such algorithms being efficient is consistently *sorting* postings lists. 34 | In this exercise, our postings lists are actually built in memory for simplicity, 35 | but we want to write algorithms that support this model of efficiently streaming postings lists 36 | from disk. 37 | 38 | ### Document-level indices 39 | 40 | **2.** 41 | Let's first write a simple routine to do an "AND" query – we intersect two postings lists. 42 | We've provided in `Intersect.java` (in `src`) a skeleton for some code that loads postings 43 | lists and tries to intersect pairs of them. Postings lists are semicolon-separated lists 44 | of document IDs, which you can pass in on the command-line or you can use our test cases. 45 | To get this code in Eclipse, do `File|Import` choose `Git|Projects from Git`, press `Next`, 46 | `Clone URI` then `Next`, then enter the HTTPS URI on this page, and `Next`, `Next`, `Next`, 47 | go with `Import existing projects`, `Next` and `Finish` – and you should be all ready to go! 48 | (The code uses the Java 7 diamond operator – if you last used Java in CS106A, then you do 49 | need to update to a more modern version of Eclipse.) 50 | 51 | Here's one potential solution: 52 | 53 | ``` 54 | static List listFromIterator(Iterator iter) { 55 | List list = new ArrayList(); 56 | Posting p; 57 | while ((p = popNextOrNull(iter)) != null) { 58 | list.add(p); 59 | } 60 | return list; 61 | } 62 | 63 | static List intersect(Iterator p1, Iterator p2) { 64 | List answer = new ArrayList(); 65 | 66 | List lp1 = listFromIterator(p1); 67 | List lp2 = listFromIterator(p2); 68 | for (Posting posting1 : lp1) { 69 | for (Posting posting2 : lp2) { 70 | if (posting1.docID == posting2.docID) { 71 | answer.add(posting1.docID); 72 | } 73 | } 74 | } 75 | 76 | return answer; 77 | } 78 | ``` 79 | 80 | Is it a good solution in terms of the criteria above? Why or why not? 81 | 82 | **3.** 83 | Let's now write a solution using a merge algorithm. 84 | It'd be by far the best if you can just write your own merge algorithm for 85 | postings list intersection from first principles! 86 | However, if you can't remember what that's about, you could look at Figure 1.6 87 | of the textbook. Check that your solution works on our test cases. 88 | 89 | **4.** 90 | Suppose we then wanted to do an "OR" algorithm to more truly "merge" postings lists 91 | How would you modify your code in `Intersect.java` to do an "OR". 92 | Try it out in `Or.java`. 93 | 94 | *~~~ If you have extra time before we move to positional indices, you can also do __7__ and __8__. ~~~* 95 | 96 | ### Positional indices 97 | 98 | It's pretty standard these days that an IR system can efficiently answer not only finding documents that 99 | contain multiple words but requiring that those words occur close by. The simplest case is 100 | phrase queries where we require them to be adjacent and ordered like ["machine learning"]. 101 | A more complex form is "WITHIN k" queries which we write "/k". For example, 102 | the query [student /3 drunk] would match a document saying either "drunk student" or 103 | "a student who is drunk" but not "the student said that the faculty member was drunk". 104 | Note that the algorithm should return **all** matches in the document, so that if document 7 is 105 | "a drunk student who is drunk", then the query [drunk /3 student] should return two maches for 106 | this document: (7, 2, 3) and (7, 6, 3). 107 | 108 | **5.** 109 | We will augment our postings lists with the positions of each token within each document, 110 | numbering them as token 1, token 2, etc. After each document ID, there will now be a colon 111 | and then a comma-separated list of positions. How can we extend our postings list merge 112 | algorithms to work with positional postings lists? Try to work out an algorithm that will do that. 113 | We've provided a skeleton in the file `PositionalWithinK.java`. Try to write something that 114 | passes the test cases provided there. (**Note:** There is some code in Introduction to Information Retrieval 115 | for this algorithm, but we're _really_ wanting you to try to write it by yourself. This education thing is 116 | all about _learning_, right? Pretend that you're practicing for your Google coding interview!) 117 | 118 | **6.** 119 | Our document-level merge algorithm had some important properties. It worked using a single forward pass 120 | through a postings list, so it was suitable for applying to postings lists streamed from disk, so it could be both time efficient 121 | (linear in the length of the postings lists) and space efficient (the memory required does not depend on the size of the postings 122 | lists, since in the streaming scenario, you can just read and refill as needed a sliding buffer over the postings list, 123 | like standard buffered IO. (As implemented in our sample code, space need only grow with the size of the set of matches.) 124 | Can these properties be maintained for doing a WITHIN k merge? We think they can! If that's not true of your solution, 125 | try to rewrite it so that: (i) The algorithm makes a single always-forward pass through each postings list and (ii) 126 | The memory required does not depend on the size of the input postings list, not even the size of the postings list 127 | for a single document (this is useful – some documents are very long!). 128 | 129 | 130 | *~~~ If you have extra time ~~~* 131 | 132 | **7.** 133 | Most search engines, including Google support a negation or "NOT" operation. 134 | For instance, search on Google for [space]. (That is search for the stuff inside the square brackets, the actual 135 | word "space" as in dark and cold.) 136 | For negation, you precede a word with "-". Try searching on Google for [space -astronomy]. See how 137 | the results change. According to boolean logic, what should happen if you just search for a negation like [-astronomy]? What 138 | does happen? Is there a good reason why things might work the way they do? 139 | 140 | **8.** 141 | Can we write an efficient merge algorithm for "AND NOT" queries? 142 | Try it out in `AndNot.java`. 143 | -------------------------------------------------------------------------------- /build.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 54 | 55 | 57 | 58 | 59 | 60 | 61 | 62 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 81 | 82 | 83 | 84 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 104 | 105 | 107 | 108 | 109 | 110 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 131 | 132 | 133 | 134 | 135 | 136 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 163 | 164 | 166 | 167 | 168 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | -------------------------------------------------------------------------------- /src/AndNot.java: -------------------------------------------------------------------------------- 1 | import java.util.ArrayList; 2 | import java.util.Iterator; 3 | import java.util.List; 4 | 5 | /** 6 | * @author Christopher Manning 7 | */ 8 | public class AndNot { 9 | 10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */ 11 | static boolean DEBUG = false; 12 | 13 | /** Test cases */ 14 | static final String[][] mergeTestCases = { 15 | { "1; 2; 4; 5; 7; 13", 16 | "1; 4; 5; 6; 8; 10; 13", 17 | "[2, 7]" }, 18 | { "1; 5", 19 | "1; 5", 20 | "[]" }, 21 | { "1", 22 | "1", 23 | "[]" }, 24 | { "1; 17; 21", 25 | "4; 5; 17; 21; 97; 108", 26 | "[1]" }, 27 | { "5; 11; 12; 14; 15; 103", 28 | "3; 8; 11; 14; 15; 16; 18; 100; 103; 109", 29 | "[5, 12]" }, 30 | { "1; 5; 11; 13; 19; 43", 31 | "2; 3; 5; 9; 11; 15; 19; 33; 45", 32 | "[1, 13, 43]" }, 33 | { "1", 34 | "2; 189", 35 | "[1]" }, 36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" + 37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" + 38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" + 39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" + 40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" + 41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" + 42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" + 43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" + 44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" + 45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" + 46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" + 47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" + 48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" + 49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" + 50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" + 51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" + 52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" + 53 | "999", // "faculty" in first 1000 documents of Stanford crawl 54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl 55 | "[3, 4, 9, 16, 19, 24, 25, 27, 28, 30, 31, 32, 33, 35, 36, 43, 46, 47, 52, 55, 57, 60, " + 56 | "61, 62, 64, 65, 66, 77, 78, 80, 83, 86, 91, 98, 99, 100, 101, 102, 103, 104, " + 57 | "106, 108, 112, 113, 116, 117, 119, 120, 127, 141, 147, 151, 156, 158, 168, 170, " + 58 | "172, 175, 179, 182, 184, 185, 187, 195, 197, 199, 202, 206, 207, 208, 209, 210, " + 59 | "213, 221, 225, 227, 228, 233, 238, 249, 252, 255, 256, 266, 267, 268, 270, 271, " + 60 | "273, 274, 281, 284, 285, 289, 290, 292, 294, 299, 301, 302, 303, 306, 308, 312, " + 61 | "320, 321, 322, 325, 326, 328, 329, 332, 334, 337, 341, 342, 344, 345, " + 62 | "347, 349, 356, 357, 358, 360, 364, 376, 377, 379, 382, 383, 385, 395, 397, 403, " + 63 | "404, 405, 406, 410, 412, 417, 423, 430, 431, 432, 433, 434, 437, 440, 441, " + 64 | "445, 446, 452, 453, 454, 461, 464, 469, 477, 480, 486, 487, 488, 495, 496, " + 65 | "506, 507, 511, 512, 517, 518, 520, 522, 524, 526, 532, 535, 540, 543, 549, " + 66 | "550, 558, 562, 563, 564, 571, 574, 581, 586, 587, 592, 597, 598, 604, 607, 608, " + 67 | "615, 620, 621, 622, 625, 633, 634, 635, 636, 639, 640, 642, 653, 654, 656, 658, " + 68 | "660, 668, 671, 676, 680, 681, 683, 687, 689, 694, 697, 702, 703, 708, 710, " + 69 | "711, 714, 722, 723, 729, 730, 737, 739, 742, 746, 747, 750, 756, 757, 758, 759, " + 70 | "764, 765, 766, 769, 770, 772, 777, 780, 782, 783, 784, 791, 795, 798, 801, 807, " + 71 | "812, 815, 816, 822, 823, 824, 825, 828, 830, 833, 835, 836, 837, 841, 852, 854, " + 72 | "863, 864, 865, 868, 870, 873, 880, 882, 884, 887, 888, 889, 897, 902, 906, 912, " + 73 | "914, 918, 922, 924, 925, 928, 929, 932, 933, 934, 938, 939, 941, 943, 944, 947, " + 74 | "948, 952, 955, 961, 962, 963, 968, 971, 973, 975, 979, 980, 983, 984, 987, 989, " + 75 | "993, 995, 996, 999]" }, 76 | }; 77 | 78 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */ 79 | private static class Posting { 80 | final int docID; 81 | final List positions; 82 | public Posting(int docID, List positions) { 83 | this.docID = docID; 84 | this.positions = positions; 85 | } 86 | public Iterator positions() { return positions.iterator(); } 87 | public String toString() { 88 | return docID + ":" + positions; 89 | } 90 | } 91 | 92 | /** Returns the next item from the Iterator, or null if it is exhausted. 93 | * (This is a more C-like method than idiomatic Java, but we use it so as 94 | * to be more parallel to the pseudo-code in the textbook.) 95 | */ 96 | static X popNextOrNull(Iterator p) { 97 | if (p.hasNext()) { 98 | return p.next(); 99 | } else { 100 | return null; 101 | } 102 | } 103 | 104 | static List merge(Iterator p1, Iterator p2) { 105 | List answer = new ArrayList<>(); 106 | 107 | Posting pp1 = popNextOrNull(p1); 108 | Posting pp2 = popNextOrNull(p2); 109 | 110 | 111 | // WRITE ALGORITHM HERE 112 | 113 | return answer; 114 | } 115 | 116 | /** Load a single postings list: Information about where a single token 117 | * appears in documents in the collection. This can load either a document 118 | * level posting which is a list of integer docID separated by semicolons 119 | * or a positional postings list, where each docID is followed by a colon 120 | * and then 121 | * @param postingsString A String representation of a postings list 122 | * @return An Iterator over a {@code List} 123 | */ 124 | static Iterator loadPostingsList(String postingsString) { 125 | List postingsList = new ArrayList<>(); 126 | String[] postingsArray = postingsString.split(";"); 127 | for (String posting : postingsArray) { 128 | String[] bits = posting.split(":"); 129 | String[] poses = {}; 130 | if (bits.length > 1) { 131 | poses = bits[1].split(","); 132 | } 133 | int docID = Integer.valueOf(bits[0].trim()); 134 | List positions = new ArrayList<>(); 135 | for (String pos : poses) { 136 | positions.add(Integer.valueOf(pos.trim())); 137 | } 138 | Posting post = new Posting(docID, positions); 139 | postingsList.add(post); 140 | } 141 | if (DEBUG) { 142 | System.err.println("Loaded postings list: " + postingsList); 143 | } 144 | return postingsList.iterator(); 145 | } 146 | 147 | /** Main method. With no parameters, it runs some internal test cases. 148 | * With two postings list arguments, it or's the arguments given on the command line. 149 | * Otherwise, it will print a usage message. 150 | * 151 | * @param args Command-line arguments, as above. 152 | */ 153 | public static void main(String[] args) { 154 | if (args.length == 0) { 155 | for (String[] test : mergeTestCases) { 156 | Iterator pl1 = loadPostingsList(test[0]); 157 | Iterator pl2 = loadPostingsList(test[1]); 158 | System.out.println("Merge of " + test[0]); 159 | System.out.println(" and " + test[1] + ": "); 160 | List ans = merge(pl1, pl2); 161 | System.out.println("Answer: " + ans); 162 | if ( ! ans.toString().equals(test[2])) { 163 | System.out.println("Correct: " + test[2]); 164 | System.out.println("*** ERROR ***"); 165 | } 166 | System.out.println(); 167 | } 168 | } else if (args.length != 2) { 169 | System.err.println("Usage: java Or postingsList1 postingsList2"); 170 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'"); 171 | System.err.println(" or: '1; 4; 5'"); 172 | } else { 173 | Iterator pl1 = loadPostingsList(args[0]); 174 | Iterator pl2 = loadPostingsList(args[1]); 175 | List ans = merge(pl1, pl2); 176 | System.out.println(ans); 177 | } 178 | } 179 | 180 | } 181 | -------------------------------------------------------------------------------- /src/Intersect.java: -------------------------------------------------------------------------------- 1 | import java.util.ArrayList; 2 | import java.util.Iterator; 3 | import java.util.List; 4 | 5 | /** 6 | * @author Christopher Manning 7 | */ 8 | public class Intersect { 9 | 10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */ 11 | static boolean DEBUG = false; 12 | 13 | /** Test cases */ 14 | static final String[][] intersectTestCases = { 15 | { "1; 2; 4; 5; 7; 13", 16 | "1; 4; 5; 6; 8; 10; 13", 17 | "[1, 4, 5, 13]" }, 18 | { "1; 5", 19 | "1; 5", 20 | "[1, 5]" }, 21 | { "1:1,2,3,4,5,6,7", 22 | "1:1,2,3,4,5,6,7", 23 | "[1]" }, 24 | { "1:11,92; 17:6,16; 21:103,113,114", 25 | "4:8; 5:2; 17:11; 21:3, 97,108", 26 | "[17, 21]" }, 27 | { "5:4; 11:7,18; 12:1,17; 14:8,16; 15:363,367; 103:28", 28 | "3:2; 8:9; 11:17,25; 14:17,434; 15:101; 16:19; 18:42; 100:11; 103:24; 109:11", 29 | "[11, 14, 15, 103]" }, 30 | { "1:1; 5:1; 11:1; 13:1; 19:1; 43:1", 31 | "2:1; 3:1; 5:1; 9:1; 11:1; 15:1; 19:1; 33:1; 45:1", 32 | "[5, 11, 19]" }, 33 | { "1:1", 34 | "2:1; 189:10", 35 | "[]" }, 36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" + 37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" + 38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" + 39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" + 40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" + 41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" + 42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" + 43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" + 44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" + 45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" + 46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" + 47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" + 48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" + 49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" + 50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" + 51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" + 52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" + 53 | "999", // "faculty" in first 1000 documents of Stanford crawl 54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl 55 | "[335, 418, 466, 686]" }, 56 | }; 57 | 58 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */ 59 | private static class Posting { 60 | final int docID; 61 | final List positions; 62 | public Posting(int docID, List positions) { 63 | this.docID = docID; 64 | this.positions = positions; 65 | } 66 | public Iterator positions() { return positions.iterator(); } 67 | public String toString() { 68 | return docID + ":" + positions; 69 | } 70 | } 71 | 72 | /** Returns the next item from the Iterator, or null if it is exhausted. 73 | * (This is a more C-like method than idiomatic Java, but we use it so as 74 | * to be more parallel to the pseudo-code in the textbook.) 75 | */ 76 | static X popNextOrNull(Iterator p) { 77 | if (p.hasNext()) { 78 | return p.next(); 79 | } else { 80 | return null; 81 | } 82 | } 83 | 84 | static List intersect(Iterator p1, Iterator p2) { 85 | List answer = new ArrayList<>(); 86 | 87 | Posting pp1 = popNextOrNull(p1); 88 | Posting pp2 = popNextOrNull(p2); 89 | 90 | // WRITE ALGORITHM HERE 91 | 92 | return answer; 93 | } 94 | 95 | /** Load a single postings list: Information about where a single token 96 | * appears in documents in the collection. This can load either a document 97 | * level posting which is a list of integer docID separated by semicolons 98 | * or a positional postings list, where each docID is followed by a colon 99 | * and then 100 | * @param postingsString A String representation of a postings list 101 | * @return An Iterator over a {@code List} 102 | */ 103 | static Iterator loadPostingsList(String postingsString) { 104 | List postingsList = new ArrayList<>(); 105 | String[] postingsArray = postingsString.split(";"); 106 | for (String posting : postingsArray) { 107 | String[] bits = posting.split(":"); 108 | String[] poses = {}; 109 | if (bits.length > 1) { 110 | poses = bits[1].split(","); 111 | } 112 | int docID = Integer.valueOf(bits[0].trim()); 113 | List positions = new ArrayList<>(); 114 | for (String pos : poses) { 115 | positions.add(Integer.valueOf(pos.trim())); 116 | } 117 | Posting post = new Posting(docID, positions); 118 | postingsList.add(post); 119 | } 120 | if (DEBUG) { 121 | System.err.println("Loaded postings list: " + postingsList); 122 | } 123 | return postingsList.iterator(); 124 | } 125 | 126 | /** Main method. With no parameters, it runs some internal test cases. 127 | * With two postings list arguments, it intersects the arguments given on the command line. 128 | * Otherwise, it will print a usage message. 129 | * 130 | * @param args Command-line arguments, as above. 131 | */ 132 | public static void main(String[] args) { 133 | if (args.length == 0) { 134 | for (String[] test : intersectTestCases) { 135 | Iterator pl1 = loadPostingsList(test[0]); 136 | Iterator pl2 = loadPostingsList(test[1]); 137 | System.out.println("Intersection of " + test[0]); 138 | System.out.println(" and " + test[1] + ": "); 139 | List ans = intersect(pl1, pl2); 140 | System.out.println("Answer: " + ans); 141 | if ( ! ans.toString().equals(test[2])) { 142 | System.out.println("Should be: " + test[2]); 143 | System.out.println("*** ERROR ***"); 144 | } 145 | System.out.println(); 146 | } 147 | } else if (args.length != 2) { 148 | System.err.println("Usage: java Intersect postingsList1 postingsList2"); 149 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'"); 150 | System.err.println(" or: '1; 4; 5'"); 151 | } else { 152 | Iterator pl1 = loadPostingsList(args[0]); 153 | Iterator pl2 = loadPostingsList(args[1]); 154 | List ans = intersect(pl1, pl2); 155 | System.out.println(ans); 156 | } 157 | } 158 | 159 | } 160 | -------------------------------------------------------------------------------- /src/Or.java: -------------------------------------------------------------------------------- 1 | import java.util.ArrayList; 2 | import java.util.Iterator; 3 | import java.util.List; 4 | 5 | /** 6 | * @author Christopher Manning 7 | */ 8 | public class Or { 9 | 10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */ 11 | static boolean DEBUG = false; 12 | 13 | /** Test cases */ 14 | static final String[][] mergeTestCases = { 15 | { "1; 2; 4; 5; 7; 13", 16 | "1; 4; 5; 6; 8; 10; 13", 17 | "[1, 2, 4, 5, 6, 7, 8, 10, 13]" }, 18 | { "1; 5", 19 | "1; 5", 20 | "[1, 5]" }, 21 | { "1", 22 | "1", 23 | "[1]" }, 24 | { "1; 17; 21", 25 | "4; 5; 17; 21; 97; 108", 26 | "[1, 4, 5, 17, 21, 97, 108]" }, 27 | { "5; 11; 12; 14; 15; 103", 28 | "3; 8; 11; 14; 15; 16; 18; 100; 103; 109", 29 | "[3, 5, 8, 11, 12, 14, 15, 16, 18, 100, 103, 109]" }, 30 | { "1; 5; 11; 13; 19; 43", 31 | "2; 3; 5; 9; 11; 15; 19; 33; 45", 32 | "[1, 2, 3, 5, 9, 11, 13, 15, 19, 33, 43, 45]" }, 33 | { "1", 34 | "2; 189", 35 | "[1, 2, 189]" }, 36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" + 37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" + 38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" + 39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" + 40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" + 41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" + 42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" + 43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" + 44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" + 45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" + 46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" + 47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" + 48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" + 49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" + 50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" + 51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" + 52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" + 53 | "999", // "faculty" in first 1000 documents of Stanford crawl 54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl 55 | "[3, 4, 9, 16, 19, 24, 25, 27, 28, 30, 31, 32, 33, 35, 36, 43, 46, 47, 52, 55, 57, 60, " + 56 | "61, 62, 64, 65, 66, 77, 78, 80, 83, 86, 91, 98, 99, 100, 101, 102, 103, 104, " + 57 | "106, 108, 112, 113, 116, 117, 119, 120, 127, 141, 147, 151, 156, 158, 168, 170, " + 58 | "172, 175, 179, 182, 184, 185, 187, 195, 197, 199, 202, 206, 207, 208, 209, 210, " + 59 | "213, 221, 225, 227, 228, 233, 238, 249, 252, 255, 256, 266, 267, 268, 270, 271, " + 60 | "273, 274, 281, 284, 285, 289, 290, 292, 294, 299, 301, 302, 303, 306, 308, 312, " + 61 | "320, 321, 322, 324, 325, 326, 328, 329, 332, 334, 335, 337, 341, 342, 344, 345, " + 62 | "347, 349, 356, 357, 358, 360, 364, 376, 377, 379, 382, 383, 385, 395, 397, 403, " + 63 | "404, 405, 406, 410, 412, 417, 418, 423, 430, 431, 432, 433, 434, 437, 440, 441, " + 64 | "445, 446, 452, 453, 454, 461, 464, 466, 469, 477, 480, 486, 487, 488, 495, 496, " + 65 | "505, 506, 507, 511, 512, 517, 518, 520, 522, 524, 526, 532, 535, 540, 543, 549, " + 66 | "550, 558, 562, 563, 564, 571, 574, 581, 586, 587, 592, 597, 598, 604, 607, 608, " + 67 | "615, 620, 621, 622, 625, 633, 634, 635, 636, 639, 640, 642, 653, 654, 656, 658, " + 68 | "660, 668, 671, 676, 680, 681, 683, 686, 687, 689, 694, 697, 702, 703, 708, 710, " + 69 | "711, 714, 722, 723, 729, 730, 737, 739, 742, 746, 747, 750, 756, 757, 758, 759, " + 70 | "764, 765, 766, 769, 770, 772, 777, 780, 782, 783, 784, 791, 795, 798, 801, 807, " + 71 | "812, 815, 816, 822, 823, 824, 825, 828, 830, 833, 835, 836, 837, 841, 852, 854, " + 72 | "863, 864, 865, 868, 870, 873, 880, 882, 884, 887, 888, 889, 897, 902, 906, 912, " + 73 | "914, 918, 922, 924, 925, 928, 929, 932, 933, 934, 938, 939, 941, 943, 944, 947, " + 74 | "948, 952, 955, 961, 962, 963, 968, 971, 973, 975, 979, 980, 983, 984, 987, 989, " + 75 | "993, 995, 996, 999]" }, 76 | }; 77 | 78 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */ 79 | private static class Posting { 80 | final int docID; 81 | final List positions; 82 | public Posting(int docID, List positions) { 83 | this.docID = docID; 84 | this.positions = positions; 85 | } 86 | public Iterator positions() { return positions.iterator(); } 87 | public String toString() { 88 | return docID + ":" + positions; 89 | } 90 | } 91 | 92 | /** Returns the next item from the Iterator, or null if it is exhausted. 93 | * (This is a more C-like method than idiomatic Java, but we use it so as 94 | * to be more parallel to the pseudo-code in the textbook.) 95 | */ 96 | static X popNextOrNull(Iterator p) { 97 | if (p.hasNext()) { 98 | return p.next(); 99 | } else { 100 | return null; 101 | } 102 | } 103 | 104 | static List merge(Iterator p1, Iterator p2) { 105 | List answer = new ArrayList<>(); 106 | 107 | Posting pp1 = popNextOrNull(p1); 108 | Posting pp2 = popNextOrNull(p2); 109 | 110 | 111 | // WRITE ALGORITHM HERE 112 | 113 | return answer; 114 | } 115 | 116 | /** Load a single postings list: Information about where a single token 117 | * appears in documents in the collection. This can load either a document 118 | * level posting which is a list of integer docID separated by semicolons 119 | * or a positional postings list, where each docID is followed by a colon 120 | * and then 121 | * @param postingsString A String representation of a postings list 122 | * @return An Iterator over a {@code List} 123 | */ 124 | static Iterator loadPostingsList(String postingsString) { 125 | List postingsList = new ArrayList<>(); 126 | String[] postingsArray = postingsString.split(";"); 127 | for (String posting : postingsArray) { 128 | String[] bits = posting.split(":"); 129 | String[] poses = {}; 130 | if (bits.length > 1) { 131 | poses = bits[1].split(","); 132 | } 133 | int docID = Integer.valueOf(bits[0].trim()); 134 | List positions = new ArrayList<>(); 135 | for (String pos : poses) { 136 | positions.add(Integer.valueOf(pos.trim())); 137 | } 138 | Posting post = new Posting(docID, positions); 139 | postingsList.add(post); 140 | } 141 | if (DEBUG) { 142 | System.err.println("Loaded postings list: " + postingsList); 143 | } 144 | return postingsList.iterator(); 145 | } 146 | 147 | /** Main method. With no parameters, it runs some internal test cases. 148 | * With two postings list arguments, it or's the arguments given on the command line. 149 | * Otherwise, it will print a usage message. 150 | * 151 | * @param args Command-line arguments, as above. 152 | */ 153 | public static void main(String[] args) { 154 | if (args.length == 0) { 155 | for (String[] test : mergeTestCases) { 156 | Iterator pl1 = loadPostingsList(test[0]); 157 | Iterator pl2 = loadPostingsList(test[1]); 158 | System.out.println("Merge of " + test[0]); 159 | System.out.println(" and " + test[1] + ": "); 160 | List ans = merge(pl1, pl2); 161 | System.out.println("Answer: " + ans); 162 | if ( ! ans.toString().equals(test[2])) { 163 | System.out.println("Correct: " + test[2]); 164 | System.out.println("*** ERROR ***"); 165 | } 166 | System.out.println(); 167 | } 168 | } else if (args.length != 2) { 169 | System.err.println("Usage: java Or postingsList1 postingsList2"); 170 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'"); 171 | System.err.println(" or: '1; 4; 5'"); 172 | } else { 173 | Iterator pl1 = loadPostingsList(args[0]); 174 | Iterator pl2 = loadPostingsList(args[1]); 175 | List ans = merge(pl1, pl2); 176 | System.out.println(ans); 177 | } 178 | } 179 | 180 | } 181 | -------------------------------------------------------------------------------- /src/PositionalWithinK.java: -------------------------------------------------------------------------------- 1 | import java.util.ArrayList; 2 | import java.util.Iterator; 3 | import java.util.List; 4 | 5 | /** 6 | * @author Christopher Manning 7 | */ 8 | public class PositionalWithinK { 9 | 10 | static boolean DEBUG = false; 11 | 12 | static final String[][] intersectTestCases = { 13 | { "1:7,18,33,72,86,231; 2:1,17,74,222,255; 4:8,16,190,429,433; 5:363,367; 7:13,23,191; 13:28", 14 | "1:17,25; 4:17,191,291,430,434; 5:14,19,101; 6:19; 8:42; 10:11; 13:24", 15 | "[(1,18,17), (4,16,17), (4,190,191), (4,429,430), (4,429,434), (4,433,430), (4,433,434), (13,28,24)]" }, 16 | { "1:11,35,77,98,104; 5:100", 17 | "1:21,92,93,94,95,97,99,100,101,102,103,105,106,107,108,109,110; 5:94,95", 18 | "[(1,98,93), (1,98,94), (1,98,95), (1,98,97), (1,98,99), (1,98,100), (1,98,101), (1,98,102), (1,98,103), (1,104,99), (1,104,100), (1,104,101), (1,104,102), (1,104,103), (1,104,105), (1,104,106), (1,104,107), (1,104,108), (1,104,109), (5,100,95)]" }, 19 | { "1:1,2,3,4,5,6,7", 20 | "1:1,2,3,4,5,6,7", 21 | "[(1,1,1), (1,1,2), (1,1,3), (1,1,4), (1,1,5), (1,1,6), (1,2,1), (1,2,2), (1,2,3), (1,2,4), (1,2,5), (1,2,6), (1,2,7), (1,3,1), (1,3,2), (1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7), (1,4,1), (1,4,2), (1,4,3), (1,4,4), (1,4,5), (1,4,6), (1,4,7), (1,5,1), (1,5,2), (1,5,3), (1,5,4), (1,5,5), (1,5,6), (1,5,7), (1,6,1), (1,6,2), (1,6,3), (1,6,4), (1,6,5), (1,6,6), (1,6,7), (1,7,2), (1,7,3), (1,7,4), (1,7,5), (1,7,6), (1,7,7)]" }, 22 | { "1:11,92; 17:6,16; 21:103,113,114", 23 | "4:8; 5:2; 17:11; 21:3, 97,108", 24 | "[(17,6,11), (17,16,11), (21,103,108), (21,113,108)]" }, 25 | { "5:4; 11:7,18; 12:1,17; 14:8,16; 15:363,367; 7:13,23,191; 103:28", 26 | "3:2; 8:9; 11:17,25; 14:17,434; 15:101; 16:19; 18:42; 100:11; 103:24; 109:11", 27 | "[(11,18,17), (14,16,17), (103,28,24)]" }, 28 | { "1:1; 5:1; 11:1; 13:1; 19:1; 43:1", 29 | "2:1; 3:1; 5:1; 9:1; 11:1; 15:1; 19:1; 33:1; 45:1", 30 | "[(5,1,1), (11,1,1), (19,1,1)]" }, 31 | { "1:1", 32 | "2:1; 189:10", 33 | "[]" }, 34 | }; 35 | 36 | 37 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */ 38 | private static class Posting { 39 | final int docID; 40 | final List positions; 41 | public Posting(int docID, List positions) { 42 | this.docID = docID; 43 | this.positions = positions; 44 | } 45 | public Iterator positions() { return positions.iterator(); } 46 | public String toString() { 47 | return docID + ":" + positions; 48 | } 49 | } 50 | 51 | 52 | /** Stores a single positional answer: a triple of a document ID and the token position 53 | * of the matching words in the first and second postings list. 54 | */ 55 | static class AnswerElement { 56 | final int docID; 57 | final int p1pos; 58 | final int p2pos; 59 | 60 | AnswerElement(int docID, int p1pos, int p2pos) { 61 | this.docID = docID; 62 | this.p1pos = p1pos; 63 | this.p2pos = p2pos; 64 | } 65 | 66 | public String toString() { 67 | return "(" + docID + "," + p1pos + "," + p2pos + ")"; 68 | } 69 | } 70 | 71 | 72 | /** Returns the next item from the Iterator, or null if it is exhausted. 73 | * (This is a more C-like method than idiomatic Java, but we use it so as 74 | * to be more parallel to the pseudo-code in the textbook.) 75 | */ 76 | static X popNextOrNull(Iterator p) { 77 | if (p.hasNext()) { 78 | return p.next(); 79 | } else { 80 | return null; 81 | } 82 | } 83 | 84 | 85 | /** Find proximity matches where the two words are within k words in the two postings lists. 86 | * Returns a List of (document, position_of_p1_word, position_of_p2_word) items. 87 | */ 88 | static List positionalIntersect(Iterator p1, Iterator p2, int k) { 89 | List answer = new ArrayList<>(); 90 | Posting p1posting = popNextOrNull(p1); 91 | Posting p2posting = popNextOrNull(p2); 92 | 93 | // WRITE THE ALGORITHM HERE! 94 | 95 | return answer; 96 | } 97 | 98 | 99 | /** Load a single postings list: Information about where a single token 100 | * appears in documents in the collection. This can load either a document 101 | * level posting which is a list of integer docID separated by semicolons 102 | * or a positional postings list, where each docID is followed by a colon 103 | * and then 104 | * @param postingsString A String representation of a postings list 105 | * @return An Iterator over a {@code List} 106 | */ 107 | static Iterator loadPostingsList(String postingsString) { 108 | List postingsList = new ArrayList<>(); 109 | String[] postingsArray = postingsString.split(";"); 110 | for (String posting : postingsArray) { 111 | String[] bits = posting.split(":"); 112 | String[] poses = {}; 113 | if (bits.length > 1) { 114 | poses = bits[1].split(","); 115 | } 116 | int docID = Integer.valueOf(bits[0].trim()); 117 | List positions = new ArrayList<>(); 118 | for (String pos : poses) { 119 | positions.add(Integer.valueOf(pos.trim())); 120 | } 121 | Posting post = new Posting(docID, positions); 122 | postingsList.add(post); 123 | } 124 | if (DEBUG) { 125 | System.err.println("Loaded postings list: " + postingsList); 126 | } 127 | return postingsList.iterator(); 128 | } 129 | 130 | 131 | public static void main(String[] args) { 132 | if (args.length == 0) { 133 | for (String[] test : intersectTestCases) { 134 | Iterator pl1 = loadPostingsList(test[0]); 135 | Iterator pl2 = loadPostingsList(test[1]); 136 | System.out.println("Intersection of " + test[0]); 137 | System.out.println(" and " + test[1] + ": "); 138 | List ans = positionalIntersect(pl1, pl2, 5); 139 | System.out.println("Answer: " + ans); 140 | if ( ! ans.toString().equals(test[2])) { 141 | System.out.println("Should be: " + test[2]); 142 | System.out.println("*** ERROR ***"); 143 | } 144 | System.out.println(); 145 | } 146 | } else if (args.length != 2) { 147 | System.err.println("Usage: java Intersect postingsList1 postingsList2"); 148 | System.err.println(" postingsList format: '1:17,25; 4:17,191,291,430,434; 5:14,19,10'"); 149 | } else { 150 | Iterator pl1 = loadPostingsList(args[0]); 151 | Iterator pl2 = loadPostingsList(args[1]); 152 | List ans = positionalIntersect(pl1, pl2, 5); 153 | System.out.println(ans); 154 | } 155 | } 156 | 157 | } 158 | --------------------------------------------------------------------------------