├── .classpath
├── .project
├── README.md
├── build.xml
└── src
├── AndNot.java
├── Intersect.java
├── Or.java
└── PositionalWithinK.java
/.classpath:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.project:
--------------------------------------------------------------------------------
1 |
2 |
3 | Merge Algorithms
4 |
5 |
6 |
7 |
8 |
9 | org.eclipse.jdt.core.javabuilder
10 |
11 |
12 |
13 |
14 |
15 | org.eclipse.jdt.core.javanature
16 |
17 |
18 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CS276 merge algorithm exercises
2 |
3 | Whenever we are performing a query with more than one word, we have to perform
4 | some kind of combination of the results for documents that do or do not
5 | contain each word.
6 |
7 | This exercise works through some of the cases that you should understand.
8 |
9 | In the first class, we spent almost all the time talking about Google....
10 | A large search engine has thousands of machines and can keep a ton of stuff
11 | in memory. But we should also consider small search systems, such as Spotlight
12 | search on your Mac or the equivalent Windows Search for Windows.
13 |
14 | **1.**
15 | Do standard Linux distributions provide a search engine for your computer?
16 | Your Linux geek friend says that it's easy and that if you want to search for
17 | files with "apple" and "computer" in them, all you need to do is type:
18 | ```bash
19 | comm -12 <(grep -Riwl "apple" /) <(grep -Riwl "computer" /)
20 | ```
21 | Is that a good solution?
22 |
23 | If we want to provide a good text search system on a small machine:
24 | 1. For speed, we need to index.
25 | 2. For a good trade-off on memory use/time efficiency, we probably keep the dictionary in memory
26 | but we need to keep all the postings lists on disk
27 | 3. For acceptable speed and memory use when working with postings lists on disk, we need to have an algorithm
28 | that streams postings from disk and just iterates once forward through the
29 | postings list. A hard disk does not support efficient random access.
30 | Algorithms that do this for two or more lists simultaneously are referred to
31 | as "merge algorithms". The name is maybe misleading, since we sometimes, e.g.,
32 | intersect rather than merging lists, but the name is traditional.
33 | 4. The secret to such algorithms being efficient is consistently *sorting* postings lists.
34 | In this exercise, our postings lists are actually built in memory for simplicity,
35 | but we want to write algorithms that support this model of efficiently streaming postings lists
36 | from disk.
37 |
38 | ### Document-level indices
39 |
40 | **2.**
41 | Let's first write a simple routine to do an "AND" query – we intersect two postings lists.
42 | We've provided in `Intersect.java` (in `src`) a skeleton for some code that loads postings
43 | lists and tries to intersect pairs of them. Postings lists are semicolon-separated lists
44 | of document IDs, which you can pass in on the command-line or you can use our test cases.
45 | To get this code in Eclipse, do `File|Import` choose `Git|Projects from Git`, press `Next`,
46 | `Clone URI` then `Next`, then enter the HTTPS URI on this page, and `Next`, `Next`, `Next`,
47 | go with `Import existing projects`, `Next` and `Finish` – and you should be all ready to go!
48 | (The code uses the Java 7 diamond operator – if you last used Java in CS106A, then you do
49 | need to update to a more modern version of Eclipse.)
50 |
51 | Here's one potential solution:
52 |
53 | ```
54 | static List listFromIterator(Iterator iter) {
55 | List list = new ArrayList();
56 | Posting p;
57 | while ((p = popNextOrNull(iter)) != null) {
58 | list.add(p);
59 | }
60 | return list;
61 | }
62 |
63 | static List intersect(Iterator p1, Iterator p2) {
64 | List answer = new ArrayList();
65 |
66 | List lp1 = listFromIterator(p1);
67 | List lp2 = listFromIterator(p2);
68 | for (Posting posting1 : lp1) {
69 | for (Posting posting2 : lp2) {
70 | if (posting1.docID == posting2.docID) {
71 | answer.add(posting1.docID);
72 | }
73 | }
74 | }
75 |
76 | return answer;
77 | }
78 | ```
79 |
80 | Is it a good solution in terms of the criteria above? Why or why not?
81 |
82 | **3.**
83 | Let's now write a solution using a merge algorithm.
84 | It'd be by far the best if you can just write your own merge algorithm for
85 | postings list intersection from first principles!
86 | However, if you can't remember what that's about, you could look at Figure 1.6
87 | of the textbook. Check that your solution works on our test cases.
88 |
89 | **4.**
90 | Suppose we then wanted to do an "OR" algorithm to more truly "merge" postings lists
91 | How would you modify your code in `Intersect.java` to do an "OR".
92 | Try it out in `Or.java`.
93 |
94 | *~~~ If you have extra time before we move to positional indices, you can also do __7__ and __8__. ~~~*
95 |
96 | ### Positional indices
97 |
98 | It's pretty standard these days that an IR system can efficiently answer not only finding documents that
99 | contain multiple words but requiring that those words occur close by. The simplest case is
100 | phrase queries where we require them to be adjacent and ordered like ["machine learning"].
101 | A more complex form is "WITHIN k" queries which we write "/k". For example,
102 | the query [student /3 drunk] would match a document saying either "drunk student" or
103 | "a student who is drunk" but not "the student said that the faculty member was drunk".
104 | Note that the algorithm should return **all** matches in the document, so that if document 7 is
105 | "a drunk student who is drunk", then the query [drunk /3 student] should return two maches for
106 | this document: (7, 2, 3) and (7, 6, 3).
107 |
108 | **5.**
109 | We will augment our postings lists with the positions of each token within each document,
110 | numbering them as token 1, token 2, etc. After each document ID, there will now be a colon
111 | and then a comma-separated list of positions. How can we extend our postings list merge
112 | algorithms to work with positional postings lists? Try to work out an algorithm that will do that.
113 | We've provided a skeleton in the file `PositionalWithinK.java`. Try to write something that
114 | passes the test cases provided there. (**Note:** There is some code in Introduction to Information Retrieval
115 | for this algorithm, but we're _really_ wanting you to try to write it by yourself. This education thing is
116 | all about _learning_, right? Pretend that you're practicing for your Google coding interview!)
117 |
118 | **6.**
119 | Our document-level merge algorithm had some important properties. It worked using a single forward pass
120 | through a postings list, so it was suitable for applying to postings lists streamed from disk, so it could be both time efficient
121 | (linear in the length of the postings lists) and space efficient (the memory required does not depend on the size of the postings
122 | lists, since in the streaming scenario, you can just read and refill as needed a sliding buffer over the postings list,
123 | like standard buffered IO. (As implemented in our sample code, space need only grow with the size of the set of matches.)
124 | Can these properties be maintained for doing a WITHIN k merge? We think they can! If that's not true of your solution,
125 | try to rewrite it so that: (i) The algorithm makes a single always-forward pass through each postings list and (ii)
126 | The memory required does not depend on the size of the input postings list, not even the size of the postings list
127 | for a single document (this is useful – some documents are very long!).
128 |
129 |
130 | *~~~ If you have extra time ~~~*
131 |
132 | **7.**
133 | Most search engines, including Google support a negation or "NOT" operation.
134 | For instance, search on Google for [space]. (That is search for the stuff inside the square brackets, the actual
135 | word "space" as in dark and cold.)
136 | For negation, you precede a word with "-". Try searching on Google for [space -astronomy]. See how
137 | the results change. According to boolean logic, what should happen if you just search for a negation like [-astronomy]? What
138 | does happen? Is there a good reason why things might work the way they do?
139 |
140 | **8.**
141 | Can we write an efficient merge algorithm for "AND NOT" queries?
142 | Try it out in `AndNot.java`.
143 |
--------------------------------------------------------------------------------
/build.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
54 |
55 |
57 |
58 |
59 |
60 |
61 |
62 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
81 |
82 |
83 |
84 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
104 |
105 |
107 |
108 |
109 |
110 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
131 |
132 |
133 |
134 |
135 |
136 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
163 |
164 |
166 |
167 |
168 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
--------------------------------------------------------------------------------
/src/AndNot.java:
--------------------------------------------------------------------------------
1 | import java.util.ArrayList;
2 | import java.util.Iterator;
3 | import java.util.List;
4 |
5 | /**
6 | * @author Christopher Manning
7 | */
8 | public class AndNot {
9 |
10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */
11 | static boolean DEBUG = false;
12 |
13 | /** Test cases */
14 | static final String[][] mergeTestCases = {
15 | { "1; 2; 4; 5; 7; 13",
16 | "1; 4; 5; 6; 8; 10; 13",
17 | "[2, 7]" },
18 | { "1; 5",
19 | "1; 5",
20 | "[]" },
21 | { "1",
22 | "1",
23 | "[]" },
24 | { "1; 17; 21",
25 | "4; 5; 17; 21; 97; 108",
26 | "[1]" },
27 | { "5; 11; 12; 14; 15; 103",
28 | "3; 8; 11; 14; 15; 16; 18; 100; 103; 109",
29 | "[5, 12]" },
30 | { "1; 5; 11; 13; 19; 43",
31 | "2; 3; 5; 9; 11; 15; 19; 33; 45",
32 | "[1, 13, 43]" },
33 | { "1",
34 | "2; 189",
35 | "[1]" },
36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" +
37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" +
38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" +
39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" +
40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" +
41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" +
42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" +
43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" +
44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" +
45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" +
46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" +
47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" +
48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" +
49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" +
50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" +
51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" +
52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" +
53 | "999", // "faculty" in first 1000 documents of Stanford crawl
54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl
55 | "[3, 4, 9, 16, 19, 24, 25, 27, 28, 30, 31, 32, 33, 35, 36, 43, 46, 47, 52, 55, 57, 60, " +
56 | "61, 62, 64, 65, 66, 77, 78, 80, 83, 86, 91, 98, 99, 100, 101, 102, 103, 104, " +
57 | "106, 108, 112, 113, 116, 117, 119, 120, 127, 141, 147, 151, 156, 158, 168, 170, " +
58 | "172, 175, 179, 182, 184, 185, 187, 195, 197, 199, 202, 206, 207, 208, 209, 210, " +
59 | "213, 221, 225, 227, 228, 233, 238, 249, 252, 255, 256, 266, 267, 268, 270, 271, " +
60 | "273, 274, 281, 284, 285, 289, 290, 292, 294, 299, 301, 302, 303, 306, 308, 312, " +
61 | "320, 321, 322, 325, 326, 328, 329, 332, 334, 337, 341, 342, 344, 345, " +
62 | "347, 349, 356, 357, 358, 360, 364, 376, 377, 379, 382, 383, 385, 395, 397, 403, " +
63 | "404, 405, 406, 410, 412, 417, 423, 430, 431, 432, 433, 434, 437, 440, 441, " +
64 | "445, 446, 452, 453, 454, 461, 464, 469, 477, 480, 486, 487, 488, 495, 496, " +
65 | "506, 507, 511, 512, 517, 518, 520, 522, 524, 526, 532, 535, 540, 543, 549, " +
66 | "550, 558, 562, 563, 564, 571, 574, 581, 586, 587, 592, 597, 598, 604, 607, 608, " +
67 | "615, 620, 621, 622, 625, 633, 634, 635, 636, 639, 640, 642, 653, 654, 656, 658, " +
68 | "660, 668, 671, 676, 680, 681, 683, 687, 689, 694, 697, 702, 703, 708, 710, " +
69 | "711, 714, 722, 723, 729, 730, 737, 739, 742, 746, 747, 750, 756, 757, 758, 759, " +
70 | "764, 765, 766, 769, 770, 772, 777, 780, 782, 783, 784, 791, 795, 798, 801, 807, " +
71 | "812, 815, 816, 822, 823, 824, 825, 828, 830, 833, 835, 836, 837, 841, 852, 854, " +
72 | "863, 864, 865, 868, 870, 873, 880, 882, 884, 887, 888, 889, 897, 902, 906, 912, " +
73 | "914, 918, 922, 924, 925, 928, 929, 932, 933, 934, 938, 939, 941, 943, 944, 947, " +
74 | "948, 952, 955, 961, 962, 963, 968, 971, 973, 975, 979, 980, 983, 984, 987, 989, " +
75 | "993, 995, 996, 999]" },
76 | };
77 |
78 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */
79 | private static class Posting {
80 | final int docID;
81 | final List positions;
82 | public Posting(int docID, List positions) {
83 | this.docID = docID;
84 | this.positions = positions;
85 | }
86 | public Iterator positions() { return positions.iterator(); }
87 | public String toString() {
88 | return docID + ":" + positions;
89 | }
90 | }
91 |
92 | /** Returns the next item from the Iterator, or null if it is exhausted.
93 | * (This is a more C-like method than idiomatic Java, but we use it so as
94 | * to be more parallel to the pseudo-code in the textbook.)
95 | */
96 | static X popNextOrNull(Iterator p) {
97 | if (p.hasNext()) {
98 | return p.next();
99 | } else {
100 | return null;
101 | }
102 | }
103 |
104 | static List merge(Iterator p1, Iterator p2) {
105 | List answer = new ArrayList<>();
106 |
107 | Posting pp1 = popNextOrNull(p1);
108 | Posting pp2 = popNextOrNull(p2);
109 |
110 |
111 | // WRITE ALGORITHM HERE
112 |
113 | return answer;
114 | }
115 |
116 | /** Load a single postings list: Information about where a single token
117 | * appears in documents in the collection. This can load either a document
118 | * level posting which is a list of integer docID separated by semicolons
119 | * or a positional postings list, where each docID is followed by a colon
120 | * and then
121 | * @param postingsString A String representation of a postings list
122 | * @return An Iterator over a {@code List}
123 | */
124 | static Iterator loadPostingsList(String postingsString) {
125 | List postingsList = new ArrayList<>();
126 | String[] postingsArray = postingsString.split(";");
127 | for (String posting : postingsArray) {
128 | String[] bits = posting.split(":");
129 | String[] poses = {};
130 | if (bits.length > 1) {
131 | poses = bits[1].split(",");
132 | }
133 | int docID = Integer.valueOf(bits[0].trim());
134 | List positions = new ArrayList<>();
135 | for (String pos : poses) {
136 | positions.add(Integer.valueOf(pos.trim()));
137 | }
138 | Posting post = new Posting(docID, positions);
139 | postingsList.add(post);
140 | }
141 | if (DEBUG) {
142 | System.err.println("Loaded postings list: " + postingsList);
143 | }
144 | return postingsList.iterator();
145 | }
146 |
147 | /** Main method. With no parameters, it runs some internal test cases.
148 | * With two postings list arguments, it or's the arguments given on the command line.
149 | * Otherwise, it will print a usage message.
150 | *
151 | * @param args Command-line arguments, as above.
152 | */
153 | public static void main(String[] args) {
154 | if (args.length == 0) {
155 | for (String[] test : mergeTestCases) {
156 | Iterator pl1 = loadPostingsList(test[0]);
157 | Iterator pl2 = loadPostingsList(test[1]);
158 | System.out.println("Merge of " + test[0]);
159 | System.out.println(" and " + test[1] + ": ");
160 | List ans = merge(pl1, pl2);
161 | System.out.println("Answer: " + ans);
162 | if ( ! ans.toString().equals(test[2])) {
163 | System.out.println("Correct: " + test[2]);
164 | System.out.println("*** ERROR ***");
165 | }
166 | System.out.println();
167 | }
168 | } else if (args.length != 2) {
169 | System.err.println("Usage: java Or postingsList1 postingsList2");
170 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'");
171 | System.err.println(" or: '1; 4; 5'");
172 | } else {
173 | Iterator pl1 = loadPostingsList(args[0]);
174 | Iterator pl2 = loadPostingsList(args[1]);
175 | List ans = merge(pl1, pl2);
176 | System.out.println(ans);
177 | }
178 | }
179 |
180 | }
181 |
--------------------------------------------------------------------------------
/src/Intersect.java:
--------------------------------------------------------------------------------
1 | import java.util.ArrayList;
2 | import java.util.Iterator;
3 | import java.util.List;
4 |
5 | /**
6 | * @author Christopher Manning
7 | */
8 | public class Intersect {
9 |
10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */
11 | static boolean DEBUG = false;
12 |
13 | /** Test cases */
14 | static final String[][] intersectTestCases = {
15 | { "1; 2; 4; 5; 7; 13",
16 | "1; 4; 5; 6; 8; 10; 13",
17 | "[1, 4, 5, 13]" },
18 | { "1; 5",
19 | "1; 5",
20 | "[1, 5]" },
21 | { "1:1,2,3,4,5,6,7",
22 | "1:1,2,3,4,5,6,7",
23 | "[1]" },
24 | { "1:11,92; 17:6,16; 21:103,113,114",
25 | "4:8; 5:2; 17:11; 21:3, 97,108",
26 | "[17, 21]" },
27 | { "5:4; 11:7,18; 12:1,17; 14:8,16; 15:363,367; 103:28",
28 | "3:2; 8:9; 11:17,25; 14:17,434; 15:101; 16:19; 18:42; 100:11; 103:24; 109:11",
29 | "[11, 14, 15, 103]" },
30 | { "1:1; 5:1; 11:1; 13:1; 19:1; 43:1",
31 | "2:1; 3:1; 5:1; 9:1; 11:1; 15:1; 19:1; 33:1; 45:1",
32 | "[5, 11, 19]" },
33 | { "1:1",
34 | "2:1; 189:10",
35 | "[]" },
36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" +
37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" +
38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" +
39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" +
40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" +
41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" +
42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" +
43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" +
44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" +
45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" +
46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" +
47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" +
48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" +
49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" +
50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" +
51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" +
52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" +
53 | "999", // "faculty" in first 1000 documents of Stanford crawl
54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl
55 | "[335, 418, 466, 686]" },
56 | };
57 |
58 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */
59 | private static class Posting {
60 | final int docID;
61 | final List positions;
62 | public Posting(int docID, List positions) {
63 | this.docID = docID;
64 | this.positions = positions;
65 | }
66 | public Iterator positions() { return positions.iterator(); }
67 | public String toString() {
68 | return docID + ":" + positions;
69 | }
70 | }
71 |
72 | /** Returns the next item from the Iterator, or null if it is exhausted.
73 | * (This is a more C-like method than idiomatic Java, but we use it so as
74 | * to be more parallel to the pseudo-code in the textbook.)
75 | */
76 | static X popNextOrNull(Iterator p) {
77 | if (p.hasNext()) {
78 | return p.next();
79 | } else {
80 | return null;
81 | }
82 | }
83 |
84 | static List intersect(Iterator p1, Iterator p2) {
85 | List answer = new ArrayList<>();
86 |
87 | Posting pp1 = popNextOrNull(p1);
88 | Posting pp2 = popNextOrNull(p2);
89 |
90 | // WRITE ALGORITHM HERE
91 |
92 | return answer;
93 | }
94 |
95 | /** Load a single postings list: Information about where a single token
96 | * appears in documents in the collection. This can load either a document
97 | * level posting which is a list of integer docID separated by semicolons
98 | * or a positional postings list, where each docID is followed by a colon
99 | * and then
100 | * @param postingsString A String representation of a postings list
101 | * @return An Iterator over a {@code List}
102 | */
103 | static Iterator loadPostingsList(String postingsString) {
104 | List postingsList = new ArrayList<>();
105 | String[] postingsArray = postingsString.split(";");
106 | for (String posting : postingsArray) {
107 | String[] bits = posting.split(":");
108 | String[] poses = {};
109 | if (bits.length > 1) {
110 | poses = bits[1].split(",");
111 | }
112 | int docID = Integer.valueOf(bits[0].trim());
113 | List positions = new ArrayList<>();
114 | for (String pos : poses) {
115 | positions.add(Integer.valueOf(pos.trim()));
116 | }
117 | Posting post = new Posting(docID, positions);
118 | postingsList.add(post);
119 | }
120 | if (DEBUG) {
121 | System.err.println("Loaded postings list: " + postingsList);
122 | }
123 | return postingsList.iterator();
124 | }
125 |
126 | /** Main method. With no parameters, it runs some internal test cases.
127 | * With two postings list arguments, it intersects the arguments given on the command line.
128 | * Otherwise, it will print a usage message.
129 | *
130 | * @param args Command-line arguments, as above.
131 | */
132 | public static void main(String[] args) {
133 | if (args.length == 0) {
134 | for (String[] test : intersectTestCases) {
135 | Iterator pl1 = loadPostingsList(test[0]);
136 | Iterator pl2 = loadPostingsList(test[1]);
137 | System.out.println("Intersection of " + test[0]);
138 | System.out.println(" and " + test[1] + ": ");
139 | List ans = intersect(pl1, pl2);
140 | System.out.println("Answer: " + ans);
141 | if ( ! ans.toString().equals(test[2])) {
142 | System.out.println("Should be: " + test[2]);
143 | System.out.println("*** ERROR ***");
144 | }
145 | System.out.println();
146 | }
147 | } else if (args.length != 2) {
148 | System.err.println("Usage: java Intersect postingsList1 postingsList2");
149 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'");
150 | System.err.println(" or: '1; 4; 5'");
151 | } else {
152 | Iterator pl1 = loadPostingsList(args[0]);
153 | Iterator pl2 = loadPostingsList(args[1]);
154 | List ans = intersect(pl1, pl2);
155 | System.out.println(ans);
156 | }
157 | }
158 |
159 | }
160 |
--------------------------------------------------------------------------------
/src/Or.java:
--------------------------------------------------------------------------------
1 | import java.util.ArrayList;
2 | import java.util.Iterator;
3 | import java.util.List;
4 |
5 | /**
6 | * @author Christopher Manning
7 | */
8 | public class Or {
9 |
10 | /** If true, loaded postings lists are printed; this just shows that they were loaded correctly. */
11 | static boolean DEBUG = false;
12 |
13 | /** Test cases */
14 | static final String[][] mergeTestCases = {
15 | { "1; 2; 4; 5; 7; 13",
16 | "1; 4; 5; 6; 8; 10; 13",
17 | "[1, 2, 4, 5, 6, 7, 8, 10, 13]" },
18 | { "1; 5",
19 | "1; 5",
20 | "[1, 5]" },
21 | { "1",
22 | "1",
23 | "[1]" },
24 | { "1; 17; 21",
25 | "4; 5; 17; 21; 97; 108",
26 | "[1, 4, 5, 17, 21, 97, 108]" },
27 | { "5; 11; 12; 14; 15; 103",
28 | "3; 8; 11; 14; 15; 16; 18; 100; 103; 109",
29 | "[3, 5, 8, 11, 12, 14, 15, 16, 18, 100, 103, 109]" },
30 | { "1; 5; 11; 13; 19; 43",
31 | "2; 3; 5; 9; 11; 15; 19; 33; 45",
32 | "[1, 2, 3, 5, 9, 11, 13, 15, 19, 33, 43, 45]" },
33 | { "1",
34 | "2; 189",
35 | "[1, 2, 189]" },
36 | { "3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;" +
37 | "64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;" +
38 | "117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;" +
39 | "197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;" +
40 | "266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;" +
41 | "308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;" +
42 | "356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;" +
43 | "417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;" +
44 | "469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;" +
45 | "535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;" +
46 | "608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;" +
47 | "671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;" +
48 | "730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;" +
49 | "782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;" +
50 | "836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;" +
51 | "906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;" +
52 | "952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;" +
53 | "999", // "faculty" in first 1000 documents of Stanford crawl
54 | "324;335;418;466;505;686", // "anthropology" in first 1000 documents of Stanford crawl
55 | "[3, 4, 9, 16, 19, 24, 25, 27, 28, 30, 31, 32, 33, 35, 36, 43, 46, 47, 52, 55, 57, 60, " +
56 | "61, 62, 64, 65, 66, 77, 78, 80, 83, 86, 91, 98, 99, 100, 101, 102, 103, 104, " +
57 | "106, 108, 112, 113, 116, 117, 119, 120, 127, 141, 147, 151, 156, 158, 168, 170, " +
58 | "172, 175, 179, 182, 184, 185, 187, 195, 197, 199, 202, 206, 207, 208, 209, 210, " +
59 | "213, 221, 225, 227, 228, 233, 238, 249, 252, 255, 256, 266, 267, 268, 270, 271, " +
60 | "273, 274, 281, 284, 285, 289, 290, 292, 294, 299, 301, 302, 303, 306, 308, 312, " +
61 | "320, 321, 322, 324, 325, 326, 328, 329, 332, 334, 335, 337, 341, 342, 344, 345, " +
62 | "347, 349, 356, 357, 358, 360, 364, 376, 377, 379, 382, 383, 385, 395, 397, 403, " +
63 | "404, 405, 406, 410, 412, 417, 418, 423, 430, 431, 432, 433, 434, 437, 440, 441, " +
64 | "445, 446, 452, 453, 454, 461, 464, 466, 469, 477, 480, 486, 487, 488, 495, 496, " +
65 | "505, 506, 507, 511, 512, 517, 518, 520, 522, 524, 526, 532, 535, 540, 543, 549, " +
66 | "550, 558, 562, 563, 564, 571, 574, 581, 586, 587, 592, 597, 598, 604, 607, 608, " +
67 | "615, 620, 621, 622, 625, 633, 634, 635, 636, 639, 640, 642, 653, 654, 656, 658, " +
68 | "660, 668, 671, 676, 680, 681, 683, 686, 687, 689, 694, 697, 702, 703, 708, 710, " +
69 | "711, 714, 722, 723, 729, 730, 737, 739, 742, 746, 747, 750, 756, 757, 758, 759, " +
70 | "764, 765, 766, 769, 770, 772, 777, 780, 782, 783, 784, 791, 795, 798, 801, 807, " +
71 | "812, 815, 816, 822, 823, 824, 825, 828, 830, 833, 835, 836, 837, 841, 852, 854, " +
72 | "863, 864, 865, 868, 870, 873, 880, 882, 884, 887, 888, 889, 897, 902, 906, 912, " +
73 | "914, 918, 922, 924, 925, 928, 929, 932, 933, 934, 938, 939, 941, 943, 944, 947, " +
74 | "948, 952, 955, 961, 962, 963, 968, 971, 973, 975, 979, 980, 983, 984, 987, 989, " +
75 | "993, 995, 996, 999]" },
76 | };
77 |
78 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */
79 | private static class Posting {
80 | final int docID;
81 | final List positions;
82 | public Posting(int docID, List positions) {
83 | this.docID = docID;
84 | this.positions = positions;
85 | }
86 | public Iterator positions() { return positions.iterator(); }
87 | public String toString() {
88 | return docID + ":" + positions;
89 | }
90 | }
91 |
92 | /** Returns the next item from the Iterator, or null if it is exhausted.
93 | * (This is a more C-like method than idiomatic Java, but we use it so as
94 | * to be more parallel to the pseudo-code in the textbook.)
95 | */
96 | static X popNextOrNull(Iterator p) {
97 | if (p.hasNext()) {
98 | return p.next();
99 | } else {
100 | return null;
101 | }
102 | }
103 |
104 | static List merge(Iterator p1, Iterator p2) {
105 | List answer = new ArrayList<>();
106 |
107 | Posting pp1 = popNextOrNull(p1);
108 | Posting pp2 = popNextOrNull(p2);
109 |
110 |
111 | // WRITE ALGORITHM HERE
112 |
113 | return answer;
114 | }
115 |
116 | /** Load a single postings list: Information about where a single token
117 | * appears in documents in the collection. This can load either a document
118 | * level posting which is a list of integer docID separated by semicolons
119 | * or a positional postings list, where each docID is followed by a colon
120 | * and then
121 | * @param postingsString A String representation of a postings list
122 | * @return An Iterator over a {@code List}
123 | */
124 | static Iterator loadPostingsList(String postingsString) {
125 | List postingsList = new ArrayList<>();
126 | String[] postingsArray = postingsString.split(";");
127 | for (String posting : postingsArray) {
128 | String[] bits = posting.split(":");
129 | String[] poses = {};
130 | if (bits.length > 1) {
131 | poses = bits[1].split(",");
132 | }
133 | int docID = Integer.valueOf(bits[0].trim());
134 | List positions = new ArrayList<>();
135 | for (String pos : poses) {
136 | positions.add(Integer.valueOf(pos.trim()));
137 | }
138 | Posting post = new Posting(docID, positions);
139 | postingsList.add(post);
140 | }
141 | if (DEBUG) {
142 | System.err.println("Loaded postings list: " + postingsList);
143 | }
144 | return postingsList.iterator();
145 | }
146 |
147 | /** Main method. With no parameters, it runs some internal test cases.
148 | * With two postings list arguments, it or's the arguments given on the command line.
149 | * Otherwise, it will print a usage message.
150 | *
151 | * @param args Command-line arguments, as above.
152 | */
153 | public static void main(String[] args) {
154 | if (args.length == 0) {
155 | for (String[] test : mergeTestCases) {
156 | Iterator pl1 = loadPostingsList(test[0]);
157 | Iterator pl2 = loadPostingsList(test[1]);
158 | System.out.println("Merge of " + test[0]);
159 | System.out.println(" and " + test[1] + ": ");
160 | List ans = merge(pl1, pl2);
161 | System.out.println("Answer: " + ans);
162 | if ( ! ans.toString().equals(test[2])) {
163 | System.out.println("Correct: " + test[2]);
164 | System.out.println("*** ERROR ***");
165 | }
166 | System.out.println();
167 | }
168 | } else if (args.length != 2) {
169 | System.err.println("Usage: java Or postingsList1 postingsList2");
170 | System.err.println(" postingsList format(s): '1:17,25; 4:17,191,291,430,434; 5:14,19,10'");
171 | System.err.println(" or: '1; 4; 5'");
172 | } else {
173 | Iterator pl1 = loadPostingsList(args[0]);
174 | Iterator pl2 = loadPostingsList(args[1]);
175 | List ans = merge(pl1, pl2);
176 | System.out.println(ans);
177 | }
178 | }
179 |
180 | }
181 |
--------------------------------------------------------------------------------
/src/PositionalWithinK.java:
--------------------------------------------------------------------------------
1 | import java.util.ArrayList;
2 | import java.util.Iterator;
3 | import java.util.List;
4 |
5 | /**
6 | * @author Christopher Manning
7 | */
8 | public class PositionalWithinK {
9 |
10 | static boolean DEBUG = false;
11 |
12 | static final String[][] intersectTestCases = {
13 | { "1:7,18,33,72,86,231; 2:1,17,74,222,255; 4:8,16,190,429,433; 5:363,367; 7:13,23,191; 13:28",
14 | "1:17,25; 4:17,191,291,430,434; 5:14,19,101; 6:19; 8:42; 10:11; 13:24",
15 | "[(1,18,17), (4,16,17), (4,190,191), (4,429,430), (4,429,434), (4,433,430), (4,433,434), (13,28,24)]" },
16 | { "1:11,35,77,98,104; 5:100",
17 | "1:21,92,93,94,95,97,99,100,101,102,103,105,106,107,108,109,110; 5:94,95",
18 | "[(1,98,93), (1,98,94), (1,98,95), (1,98,97), (1,98,99), (1,98,100), (1,98,101), (1,98,102), (1,98,103), (1,104,99), (1,104,100), (1,104,101), (1,104,102), (1,104,103), (1,104,105), (1,104,106), (1,104,107), (1,104,108), (1,104,109), (5,100,95)]" },
19 | { "1:1,2,3,4,5,6,7",
20 | "1:1,2,3,4,5,6,7",
21 | "[(1,1,1), (1,1,2), (1,1,3), (1,1,4), (1,1,5), (1,1,6), (1,2,1), (1,2,2), (1,2,3), (1,2,4), (1,2,5), (1,2,6), (1,2,7), (1,3,1), (1,3,2), (1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7), (1,4,1), (1,4,2), (1,4,3), (1,4,4), (1,4,5), (1,4,6), (1,4,7), (1,5,1), (1,5,2), (1,5,3), (1,5,4), (1,5,5), (1,5,6), (1,5,7), (1,6,1), (1,6,2), (1,6,3), (1,6,4), (1,6,5), (1,6,6), (1,6,7), (1,7,2), (1,7,3), (1,7,4), (1,7,5), (1,7,6), (1,7,7)]" },
22 | { "1:11,92; 17:6,16; 21:103,113,114",
23 | "4:8; 5:2; 17:11; 21:3, 97,108",
24 | "[(17,6,11), (17,16,11), (21,103,108), (21,113,108)]" },
25 | { "5:4; 11:7,18; 12:1,17; 14:8,16; 15:363,367; 7:13,23,191; 103:28",
26 | "3:2; 8:9; 11:17,25; 14:17,434; 15:101; 16:19; 18:42; 100:11; 103:24; 109:11",
27 | "[(11,18,17), (14,16,17), (103,28,24)]" },
28 | { "1:1; 5:1; 11:1; 13:1; 19:1; 43:1",
29 | "2:1; 3:1; 5:1; 9:1; 11:1; 15:1; 19:1; 33:1; 45:1",
30 | "[(5,1,1), (11,1,1), (19,1,1)]" },
31 | { "1:1",
32 | "2:1; 189:10",
33 | "[]" },
34 | };
35 |
36 |
37 | /** Stores the Posting for a single document: a docID and optionally a list of document positions. */
38 | private static class Posting {
39 | final int docID;
40 | final List positions;
41 | public Posting(int docID, List positions) {
42 | this.docID = docID;
43 | this.positions = positions;
44 | }
45 | public Iterator positions() { return positions.iterator(); }
46 | public String toString() {
47 | return docID + ":" + positions;
48 | }
49 | }
50 |
51 |
52 | /** Stores a single positional answer: a triple of a document ID and the token position
53 | * of the matching words in the first and second postings list.
54 | */
55 | static class AnswerElement {
56 | final int docID;
57 | final int p1pos;
58 | final int p2pos;
59 |
60 | AnswerElement(int docID, int p1pos, int p2pos) {
61 | this.docID = docID;
62 | this.p1pos = p1pos;
63 | this.p2pos = p2pos;
64 | }
65 |
66 | public String toString() {
67 | return "(" + docID + "," + p1pos + "," + p2pos + ")";
68 | }
69 | }
70 |
71 |
72 | /** Returns the next item from the Iterator, or null if it is exhausted.
73 | * (This is a more C-like method than idiomatic Java, but we use it so as
74 | * to be more parallel to the pseudo-code in the textbook.)
75 | */
76 | static X popNextOrNull(Iterator p) {
77 | if (p.hasNext()) {
78 | return p.next();
79 | } else {
80 | return null;
81 | }
82 | }
83 |
84 |
85 | /** Find proximity matches where the two words are within k words in the two postings lists.
86 | * Returns a List of (document, position_of_p1_word, position_of_p2_word) items.
87 | */
88 | static List positionalIntersect(Iterator p1, Iterator p2, int k) {
89 | List answer = new ArrayList<>();
90 | Posting p1posting = popNextOrNull(p1);
91 | Posting p2posting = popNextOrNull(p2);
92 |
93 | // WRITE THE ALGORITHM HERE!
94 |
95 | return answer;
96 | }
97 |
98 |
99 | /** Load a single postings list: Information about where a single token
100 | * appears in documents in the collection. This can load either a document
101 | * level posting which is a list of integer docID separated by semicolons
102 | * or a positional postings list, where each docID is followed by a colon
103 | * and then
104 | * @param postingsString A String representation of a postings list
105 | * @return An Iterator over a {@code List}
106 | */
107 | static Iterator loadPostingsList(String postingsString) {
108 | List postingsList = new ArrayList<>();
109 | String[] postingsArray = postingsString.split(";");
110 | for (String posting : postingsArray) {
111 | String[] bits = posting.split(":");
112 | String[] poses = {};
113 | if (bits.length > 1) {
114 | poses = bits[1].split(",");
115 | }
116 | int docID = Integer.valueOf(bits[0].trim());
117 | List positions = new ArrayList<>();
118 | for (String pos : poses) {
119 | positions.add(Integer.valueOf(pos.trim()));
120 | }
121 | Posting post = new Posting(docID, positions);
122 | postingsList.add(post);
123 | }
124 | if (DEBUG) {
125 | System.err.println("Loaded postings list: " + postingsList);
126 | }
127 | return postingsList.iterator();
128 | }
129 |
130 |
131 | public static void main(String[] args) {
132 | if (args.length == 0) {
133 | for (String[] test : intersectTestCases) {
134 | Iterator pl1 = loadPostingsList(test[0]);
135 | Iterator pl2 = loadPostingsList(test[1]);
136 | System.out.println("Intersection of " + test[0]);
137 | System.out.println(" and " + test[1] + ": ");
138 | List ans = positionalIntersect(pl1, pl2, 5);
139 | System.out.println("Answer: " + ans);
140 | if ( ! ans.toString().equals(test[2])) {
141 | System.out.println("Should be: " + test[2]);
142 | System.out.println("*** ERROR ***");
143 | }
144 | System.out.println();
145 | }
146 | } else if (args.length != 2) {
147 | System.err.println("Usage: java Intersect postingsList1 postingsList2");
148 | System.err.println(" postingsList format: '1:17,25; 4:17,191,291,430,434; 5:14,19,10'");
149 | } else {
150 | Iterator pl1 = loadPostingsList(args[0]);
151 | Iterator pl2 = loadPostingsList(args[1]);
152 | List ans = positionalIntersect(pl1, pl2, 5);
153 | System.out.println(ans);
154 | }
155 | }
156 |
157 | }
158 |
--------------------------------------------------------------------------------