├── .gitignore
├── README.rst
├── fileformat.rst
├── pom-cdh4.xml
├── pom-hadoop-0.21.xml
├── pom-hadoop-0.23.xml
├── pom-hadoop-2.0.xml
├── pom.xml
├── python
├── README.md
├── diff_match_patch.py
├── example.py
├── page_sample.xml
├── revision_differ.py
└── xml_simulator.py
└── src
├── main
└── java
│ └── org
│ └── wikimedia
│ └── wikihadoop
│ ├── ByteMatcher.java
│ ├── SeekableInputStream.java
│ └── StreamWikiDumpInputFormat.java
└── test
└── java
└── org
└── wikimedia
└── wikihadoop
└── TestStreamWikiDumpInputFormat.java
/.gitignore:
--------------------------------------------------------------------------------
1 | target/
2 | mapred
3 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 |
2 | =====================
3 | WikiHadoop
4 | =====================
5 | --------------------------------------------------------------------------------------------
6 | Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop
7 | --------------------------------------------------------------------------------------------
8 |
9 | :Homepage: http://github.com/whym/wikihadoop
10 | :Date: 2012-04-16
11 | :Version: 0.2
12 |
13 | Overview
14 | ==============================
15 |
16 | Wikipedia XML dumps with complete edit histories [#]_ have been
17 | difficult to process because of their exceptional size and structure.
18 | While a "page" is a common processing unit, one Wikipedia page may
19 | contain more than gigabytes of text when the edit history is very
20 | long.
21 |
22 | This software provides an ``InputFormat`` for `Hadoop Streaming`_
23 | Interface that processes Wikipedia bzip2 XML dumps in a streaming
24 | manner. Using this ``InputFormat``, the content of every page is fed
25 | to a mapper via standard input and output without using too much
26 | memory. Thanks to Hadoop Streaming, mappers can be implemented in any
27 | language.
28 |
29 | See the `wiki page`__ for a more detailed introduction and tutorial.
30 |
31 | __ https://github.com/whym/wikihadoop/wiki
32 | .. _Hadoop Streaming: http://hadoop.apache.org/common/docs/current/streaming.html
33 | .. _Apache Hadoop: http://hadoop.apache.org
34 | .. _Apache Maven: http://maven.apache.org
35 | .. _WikiHadoop: http://github.com/whym/wikihadoop
36 |
37 | .. [#] For example, one dump file such as pages-meta-history1.xml.bz2,
38 | pages-meta-history6.xml.bz2, etc, provided at
39 | http://dumps.wikimedia.org/enwiki/20110803/ is more than 30
40 | gigabytes in compressed forms, and more than 700 gigabytes
41 | when decompressed.
42 |
43 | How to use
44 | ==============================
45 | Essentially WikiHadoop is an input format for ``Hadoop Streaming``. Once you have ``StreamWikiDumpInputFormat`` in the class path, you can give it into the ``-inputformat`` option.
46 |
47 | To get the input format class working with Hadoop Streaming, proceed with the following procedures:
48 |
49 | 1. Install `Apache Hadoop`_. Version 0.21, 0.22 and 0.23 are the ones we tested.
50 |
51 | - See also Requirements_.
52 |
53 | 2. Obtain our jar file from `download page`_. Alternatively, you can build the class and/or the jar by yourself (see `How to build`_). We will call the jar file ``wikihadoop.jar`` in this document.
54 |
55 | 3. Find the jar file of Hadoop Streaming in your copy of Hadoop. It is probably found at ``mapred/contrib/streaming/hadoop-*-streaming.jar``. We will call it ``hadoop-streaming.jar`` in this document.
56 |
57 | 4. Run a `Hadoop Streaming`_ command with the jar file and our input format specified.
58 |
59 | - A command will look like this: ::
60 |
61 | hadoop jar hadoop-streaming.jar -libjars wikihadoop.jar -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat -mapper /bin/cat
62 |
63 | See `Sample command line usage`_, `Configuration variables`_ and the official documentation of `Hadoop Streaming`_ for more details.
64 |
65 | We recommend to use our differ_ as the mapper when creating text
66 | diffs between consecutive revisions. The differ
67 | ``revision_differ.py`` is included in the tarball under ``diffs``, or
68 | can be downloaded from the MediaWiki SVN repository by ``svn
69 | checkout
70 | http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs``.
71 | See its `Differ's readme file`_ for more details and other requirements.
72 |
73 | Note: mappers need to be distributed to the computing nodes under
74 | the same path. To do so, you can use the ``-file`` option of Hadoop
75 | Streaming or copy the necessary files manually.
76 |
77 | .. _Differ's readme file: http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/README.txt
78 | .. _StreamWikiDumpInputFormat: https://github.com/whym/wikihadoop/blob/master/mapreduce/src/contrib/streaming/src/java/org/wikimedia/wikihadoop/StreamWikiDumpInputFormat.java
79 | .. _download page: http://whym.github.io/wikihadoop/
80 |
81 | How to build
82 | ==============================
83 |
84 | 1. Download WikiHadoop_ and extract the source tree.
85 |
86 | We provide both our git repository and a tarball package.
87 |
88 | - Use ``git clone https://whym@github.com/whym/wikihadoop.git`` to
89 | access to the latest source,
90 |
91 | 2. Run Maven to build a jar file. ::
92 |
93 | mvn package
94 |
95 | - By default it compiles with the Hadoop 0.22's code base. We have found that the resulting jar file is compatible with Hadoop 0.21, 0.23, 2.0 and CDH4. When it is incompatible for some reason, you could also try building it with customized pom files by running commands like ``mvn -f pom-hadoop-0.21.xml package`` or ``mvn -f pom-hadoop-0.23.xml package``, or changing the dependencies manually.
96 |
97 | 3. Find the resulting jar file at ``target/wikihadoop-*.jar``.
98 |
99 |
100 | Input & Output format
101 | =============================
102 |
103 | Input can be Wikipedia XML dumps either as compressed in bzip2 (this
104 | is what you can directly get from the distribution site) or
105 | uncompressed.
106 |
107 | The record reader embedded in this input format converts a page into a
108 | sequence of page-like elements, each of which contains two consecutive
109 | revisions. Output is given as key-value style records where a key is a
110 | page-like element and a value is always empty. For example, Given the
111 | following input containing two pages and four revisions, ::
112 |
113 |
114 | ABC
115 | 123
116 |
117 | 100
118 | ....
119 |
120 |
121 | 200
122 | ....
123 |
124 |
125 | 300
126 | ....
127 |
128 |
129 |
130 | DEF
131 | 456
132 |
133 | 400
134 | ....
135 |
136 |
137 |
138 | it will produce four keys formatted in page-like elements as follows ::
139 |
140 |
141 | ABC
142 | 123
143 |
144 | 100
145 | ....
146 |
147 |
148 |
149 | ::
150 |
151 |
152 | ABC
153 | 123
154 |
155 | 100
156 | ....
157 |
158 |
159 | 200
160 | ....
161 |
162 |
163 |
164 | ::
165 |
166 |
167 | ABC
168 | 123
169 |
170 | 200
171 | ....
172 |
173 |
174 | 300
175 | ....
176 |
177 |
178 |
179 | ::
180 |
181 |
182 | DEF
183 | 456
184 |
185 | 400
186 | ....
187 |
188 |
189 |
190 | Notice that before This result will provide a mapper with all information about the revision including the title and page ID. We recommend to use our differ_ to get diffs.
191 |
192 | .. _differ: http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/
193 |
194 | Requirements
195 | ==============================
196 | Following softwares are required.
197 |
198 | - `Apache Hadoop`_
199 |
200 | - Versions 0.21, 0.22, 0.23, 2.0 and CDH4 are supported.
201 | - `Cloudera's`_ cdh3u1 is also supported at the `cdh3u1 branch`_, thanks to François Kawla).
202 |
203 | - `Apache Maven`_
204 |
205 | - Version 2 or 3. (the default version we test against is 2.2.1.)
206 |
207 | See also `Supported Versions of Hadoop`_ for more information.
208 |
209 |
210 | .. _Cloudera's: https://ccp.cloudera.com/display/SUPPORT/Downloads
211 | .. _cdh3u1 branch: https://github.com/whym/wikihadoop/tree/cdh3u1
212 | .. _Supported Versions of Hadoop: https://github.com/whym/wikihadoop/wiki/Supported-Versions-of-Hadoop.
213 |
214 | Sample command line usage
215 | ==============================
216 |
217 | - To process an English Wikipedia dump with the cat command: ::
218 |
219 | hadoop jar hadoop-streaming.jar -libjars wikihadoop.jar -D mapreduce.input.fileinputformat.split.minsize=300000000 -D mapreduce.task.timeout=6000000 -input /enwiki-20110722-pages-meta-history27.xml.bz2 -output /usr/hadoop/out -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat -mapper /bin/cat
220 |
221 | Configuration variables
222 | ==============================
223 | Following parameters can be configured as similarly as other parameters described in `Hadoop Streaming`_.
224 |
225 | ``org.wikimedia.wikihadoop.excludePagesWith=REGEX``
226 | Used to exclude pages with the headers that match to this.
227 | For example, to exclude all namespaces except for the main article space, use ``-D org.wikimedia.wikihadoop.excludePagesWith="
(Media|Special|Talk|User|User talk|Wikipedia|Wikipedia talk|File|File talk|MediaWiki|MediaWiki talk|Template|Template talk|Help|Help talk|Category|Category talk|Portal|Portal talk|Book|Book talk):"``.
228 | When unspecified, WikiHadoop sends all pages to mappers.
229 |
230 | Ignoring pages irrelevant to the task is a good idea, if you want to speed up the process.
231 |
232 | ``org.wikimedia.wikihadoop.previousRevision=true or false``
233 | When set ``false``, WikiHadoop writes only one revision in one page-like element without attaching the previous revision.
234 | The default behaviour (``true``) is to write two consecutive revisions in one page-like element,
235 |
236 | ``mapreduce.input.fileinputformat.split.minsize=BYTES``
237 | This variables specified the minimum size of a split sent to
238 | input readers.
239 |
240 | The default size tends to be too small. Try changing it to a
241 | larger value by setting. The optimal value seems to be around
242 | (size of the input dump file) / (number of processors) / 5.
243 | For example, it will be 500000000 for English Wikipedia dumps
244 | when processing with 12 processors.
245 |
246 | ``mapreduce.task.timeout=MSECS``
247 | Timeout may happen when pages are too long. Try setting
248 | longer than 6000000. Before it starts
249 | parsing the data and reporting the progress, WikiHadoop can take
250 | more than 6000 seconds to preprocess XML dumps.
251 |
252 | Mechanism
253 | ==============================
254 |
255 | Splitting
256 | ----------------
257 | Input dump files are split into smaller splits with the sizes close to
258 | the value of ``mapreduce.input.fileinputformat.split.minsize``. When
259 | non-compressed input is used, each split exactly ends with a page end.
260 | When bzip2 (or other splittable compression) input is used, each split
261 | is modified so that every page is contained at least one of the
262 | splits.
263 |
264 | Parsing
265 | ----------------
266 |
267 | WikiHadoop's parser can be seen as a SAX parser that is tuned for
268 | Wikipedia dump XMLs. By limiting its flexibility, it is supposed to
269 | achieve higher efficiency. Instead of extracting all occurrence of
270 | elements and attributes, it only looks for beginnings and endings of
271 | ``page`` elements and ``revision`` elements.
272 |
273 | Known problems
274 | ==============================
275 | - Hadoop map tasks with ``StreamWikiDumpInputFormat`` may take a long
276 | time to finish preprocessing before starting reporting the progress.
277 | - Some revision pairs may be emitted twice when bzip2 input is
278 | used. (`Issue #1`_)
279 |
280 | .. _Issue #1: https://github.com/whym/wikihadoop/issues/1
281 |
282 | .. Local variables:
283 | .. mode: rst
284 | .. End:
285 |
--------------------------------------------------------------------------------
/fileformat.rst:
--------------------------------------------------------------------------------
1 | ==Location==
2 | The diffdb can be downloaded from [http://dumps.wikimedia.org/other/diffdb/ dumps.wikimedia.org].
3 |
4 | ==Fields==
5 |
6 | hadoop21@beta:~/wikihadoop/diffs$ /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-10-bzip2/part-00000 | head -n 3
7 | 133350337 11406585 0 'National security and homeland security presidential directive' 1180070193 u'Begin' False 308437 u'Badagnani' 0:1:u"The '''[[National Security and Homeland Security Presidential Directive]]''' (NSPD-51/HSPD-20), signed by President [[George W. Bush]] on May 9, 2007, is a [[Presidential Directive]] giving the [[President of the United States]] near-total control over the United States in the event of a catastrophic event, without the oversight of [[United States Congress|Congress]].\n\nThe signing of this Directive was generally unnoticed by the U.S. media as well as the U.S. Congress. It is unclear how the National Security and Homeland Security Presidential Directive will reconcile with the [[National Emergencies Act]], signed in 1976, which gives Congress oversight during such emergencies.\n\n==External links==\n*[http://www.whitehouse.gov/news/releases/2007/05/20070509-12.html National Security and Homeland Security Presidential Directive], from White House site\n\n==See also==\n*[[National Emergencies Act]]\n*[[George W. Bush]]\n\n{{US-stub}}"
8 | 133350707 11406585 0 'National security and homeland security presidential directive' 1180070344 None False 308437 u'Badagnani' 906:1:u'National Security Directive]]\n*[['
9 | 133350794 11406585 0 'National security and homeland security presidential directive' 1180070386 None False 308437 u'Badagnani' 613:-1:u'signed' 613:1:u'a U.S. federal law passed'
10 |
11 |
12 | Each row represents a revision from a XML dump of the English Wikipedia. There *should* be a row for every revision that wasn't deleted when that dump was produced; however at this time, some cleanup will need to be done to remove duplicates and fill in missing revision diffs.
13 | * rev_id: The identifier of the revision being described PRIMARY KEY
14 | * page_id: The identifier of the page being revised
15 | * namespace: The identifier of the namespace of the page
16 | * title: The title of the page being revised
17 | * timestamp: The time the revision took place as a Unix epoch timestamp in seconds
18 | * comment: The edit summary left by the editor
19 | * minor: Minor status of the edit (boolean)
20 | * user_id: The identifier of the editor who saved the revision
21 | * user_text: The username of the editor who saved the revision
22 | * diffs - Tab separated, diff operations. Each diff operation has three parts (separated by colons):
23 | ** position: The position in the article text at which the operation took place
24 | ** action: Did the operation add or remove some text? ("1" for add, "-1" for remove)
25 | ** content: The text operated on. For added text, this is the content to add. For removed text, this is the content that was removed.
26 |
27 | Each row can have 0-many diff operations. Values in the result set have been encoded using python's repr() function and can be reproduced in python with the eval() function.
28 |
29 | ==Reproduction==
30 | # Install [http://hadoop.apache.org Hadoop], [https://github.com/whym/wikihadoop WikiHadoop] and the [http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/ differ].
31 |
32 | # Log in to the Hadoop master node.
33 | # Download the Wikipedia dump files compressed in bz2 from [http://dumps.wikimedia.org/enwiki/ the dump distribution site]. Make sure to choose the dumps with full edit histories (pages-meta-historyN.xml.bz2).
34 | #* For the 20110405 dumps (this is the source of the dataset being generated): [http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history1.xml.bz2] [http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history2.xml.bz2] [http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history3.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history4.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history5.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history6.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history7.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history8.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history9.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history10.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history11.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history12.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history13.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history14.xml.bz2][http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-meta-history15.xml.bz2]
35 | # Copy the dump files in to HDFS using /usr/lib/hadoop-beta/bin/hdfs dfs -copyFromLocal enwiki*.xml
36 | # Launch a Hadoop job for each dump file using the command below.
37 | #*
38 | #* With 3 nodes and 24 cores in total, one dump file of EN wiki approximately takes 20-24 hours to process.
39 | # If you want to extract the dataset as an ordinary file, accumulate the dataset rows into one file (diffs.tsv.gz) using /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* > diffs.tsv.
40 | #* There are some duplicates in the results [https://github.com/whym/wikihadoop/issues/1]. If you want to exclude those duplicates, use /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* | sort -n -k2 -k1 -u -T ~/tmp/ > diffs.tsv instead. Note that ~/tmp needs to be a directory large enough to contain all the results shown with /usr/lib/hadoop-beta/bin/hdfs dfs -du /usr/hadoop/out-*/part-*.
41 | #* This may take several hours~one day depending on the size. It will be more than 400 GB for EN wiki.
42 |
43 | ==Notes==
44 | The dataset being generated is incomplete in two ways.
45 | * Missing entries for less than 0.003% revisions (estimated). [https://github.com/whym/wikihadoop/issues/2]
46 | * Duplicated entries for less than 0.02% revisions (estimated). [https://github.com/whym/wikihadoop/issues/1]
47 |
--------------------------------------------------------------------------------
/pom-cdh4.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 | 4.0.0
5 |
6 | org.wikimedia
7 | wikihadoop
8 | 0.2-CDH4
9 | jar
10 | wikihadoop
11 | http://github.com/whym/wikihadoop
12 |
13 |
14 | UTF-8
15 |
16 |
17 |
18 |
19 | cloudera-2
20 | Cloudera Repository
21 | https://repository.cloudera.com/artifactory/cloudera-repos
22 |
23 |
24 |
25 |
26 |
27 | org.apache.hadoop
28 | hadoop-common
29 | 2.0.0-cdh4.1.2
30 |
31 |
32 | org.apache.hadoop
33 | hadoop-mapreduce
34 | 2.0.0-cdh4.1.2
35 | pom
36 |
37 |
38 | org.apache.hadoop
39 | hadoop-client
40 | 2.0.0-cdh4.1.2
41 |
42 |
43 | log4j
44 | log4j
45 | 1.2.13
46 | test
47 |
48 |
49 | junit
50 | junit
51 | 4.10
52 | test
53 |
54 |
55 | commons-logging
56 | commons-logging
57 | 1.1.1
58 |
59 |
60 |
61 |
62 |
63 |
64 | maven-javadoc-plugin
65 | 2.8.1
66 |
67 | true
68 | true
69 |
70 |
71 |
72 | maven-jar-plugin
73 | 2.4
74 |
75 |
76 |
77 | org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
78 |
79 |
80 |
81 |
82 |
83 | org.apache.maven.plugins
84 | maven-compiler-plugin
85 | 2.4
86 |
87 | 1.5
88 | 1.5
89 |
90 |
91 |
92 |
93 |
94 | ${project.build.directory}/dependency
95 |
96 |
97 |
98 |
99 |
--------------------------------------------------------------------------------
/pom-hadoop-0.21.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 | 4.0.0
5 |
6 | org.wikimedia
7 | wikihadoop
8 | 0.2-AH_0_21
9 | jar
10 | wikihadoop
11 | http://github.com/whym/wikihadoop
12 |
13 |
14 |
15 | apache-public
16 | https://repository.apache.org/content/groups/public/
17 |
18 | true
19 |
20 |
21 | true
22 |
23 |
24 |
25 |
26 |
27 | UTF-8
28 |
29 |
30 |
31 |
32 | org.apache.hadoop
33 | hadoop-common
34 | 0.21.0-SNAPSHOT
35 |
36 |
37 | org.apache.hadoop
38 | hadoop-mapred
39 | 0.21.0-SNAPSHOT
40 |
41 |
42 | log4j
43 | log4j
44 | 1.2.13
45 | test
46 |
47 |
48 | junit
49 | junit
50 | 4.10
51 | test
52 |
53 |
54 | commons-logging
55 | commons-logging
56 | 1.1.1
57 |
58 |
59 |
60 |
61 |
62 |
63 | maven-javadoc-plugin
64 | 2.8.1
65 |
66 | true
67 | true
68 |
69 |
70 |
71 | maven-jar-plugin
72 | 2.4
73 |
74 |
75 |
76 | org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
77 |
78 |
79 |
80 |
81 |
82 | org.apache.maven.plugins
83 | maven-compiler-plugin
84 | 2.4
85 |
86 | 1.5
87 | 1.5
88 |
89 |
90 |
91 |
92 |
93 | ${project.build.directory}/dependency
94 |
95 |
96 |
97 |
98 |
--------------------------------------------------------------------------------
/pom-hadoop-0.23.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 | 4.0.0
5 |
6 | org.wikimedia
7 | wikihadoop
8 | 0.2-AH_0_23
9 | jar
10 | wikihadoop
11 | http://github.com/whym/wikihadoop
12 |
13 |
14 |
15 | apache-public
16 | https://repository.apache.org/content/groups/public/
17 |
18 | true
19 |
20 |
21 | true
22 |
23 |
24 |
25 |
26 |
27 | UTF-8
28 |
29 |
30 |
31 |
32 | org.apache.hadoop
33 | hadoop-common
34 | 0.23.3-SNAPSHOT
35 | jar
36 |
37 |
38 | org.apache.hadoop
39 | hadoop-mapreduce
40 | 0.23.3-SNAPSHOT
41 | pom
42 |
43 |
44 | org.apache.hadoop
45 | hadoop-client
46 | 0.23.3-SNAPSHOT
47 | pom
48 |
49 |
50 | log4j
51 | log4j
52 | 1.2.13
53 | test
54 |
55 |
56 | junit
57 | junit
58 | 4.10
59 | test
60 |
61 |
62 | commons-logging
63 | commons-logging
64 | 1.1.1
65 |
66 |
67 |
68 |
69 |
70 |
71 | maven-javadoc-plugin
72 | 2.8.1
73 |
74 | true
75 | true
76 |
77 |
78 |
79 | maven-jar-plugin
80 | 2.4
81 |
82 |
83 |
84 | org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
85 |
86 |
87 |
88 |
89 |
90 | org.apache.maven.plugins
91 | maven-compiler-plugin
92 | 2.4
93 |
94 | 1.5
95 | 1.5
96 |
97 |
98 |
99 |
100 |
101 | ${project.build.directory}/dependency
102 |
103 |
104 |
105 |
106 |
--------------------------------------------------------------------------------
/pom-hadoop-2.0.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 | 4.0.0
5 |
6 | org.wikimedia
7 | wikihadoop
8 | 0.2-AH_2_0_0
9 | jar
10 | wikihadoop
11 | http://github.com/whym/wikihadoop
12 |
13 |
14 |
15 | apache-public
16 | https://repository.apache.org/content/groups/public/
17 |
18 | true
19 |
20 |
21 | true
22 |
23 |
24 |
25 |
26 |
27 | UTF-8
28 |
29 |
30 |
31 |
32 | org.apache.hadoop
33 | hadoop-common
34 | 2.0.0-alpha
35 | jar
36 |
37 |
38 | org.apache.hadoop
39 | hadoop-mapreduce
40 | 2.0.0-alpha
41 | pom
42 |
43 |
44 | org.apache.hadoop
45 | hadoop-client
46 | 2.0.0-alpha
47 | pom
48 |
49 |
50 | log4j
51 | log4j
52 | 1.2.13
53 | test
54 |
55 |
56 | junit
57 | junit
58 | 4.10
59 | test
60 |
61 |
62 | commons-logging
63 | commons-logging
64 | 1.1.1
65 |
66 |
67 |
68 |
69 |
70 |
71 | maven-javadoc-plugin
72 | 2.8.1
73 |
74 | true
75 | true
76 |
77 |
78 |
79 | maven-jar-plugin
80 | 2.4
81 |
82 |
83 |
84 | org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
85 |
86 |
87 |
88 |
89 |
90 | org.apache.maven.plugins
91 | maven-compiler-plugin
92 | 2.4
93 |
94 | 1.5
95 | 1.5
96 |
97 |
98 |
99 |
100 |
101 | ${project.build.directory}/dependency
102 |
103 |
104 |
105 |
106 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 |
4 | 4.0.0
5 |
6 | org.wikimedia
7 | wikihadoop
8 | 0.2
9 | jar
10 | wikihadoop
11 | http://github.com/whym/wikihadoop
12 |
13 |
14 |
15 | apache-public
16 | https://repository.apache.org/content/groups/public/
17 |
18 | true
19 |
20 |
21 | true
22 |
23 |
24 |
25 |
26 |
27 | UTF-8
28 |
29 |
30 |
31 |
32 | org.apache.hadoop
33 | hadoop-common
34 | 0.22.0
35 |
36 |
37 | org.apache.hadoop
38 | hadoop-mapred
39 | 0.22.0
40 |
41 |
42 | log4j
43 | log4j
44 | 1.2.13
45 | test
46 |
47 |
48 | junit
49 | junit
50 | 4.10
51 | test
52 |
53 |
54 | commons-logging
55 | commons-logging
56 | 1.1.1
57 |
58 |
59 |
60 |
61 |
62 |
63 | maven-javadoc-plugin
64 | 2.8.1
65 |
66 | true
67 | true
68 |
69 |
70 |
71 | maven-jar-plugin
72 | 2.4
73 |
74 |
75 |
76 | org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
77 |
78 |
79 |
80 |
81 |
82 | org.apache.maven.plugins
83 | maven-compiler-plugin
84 | 2.4
85 |
86 | 1.5
87 | 1.5
88 |
89 |
90 |
91 |
92 |
93 | ${project.build.directory}/dependency
94 |
95 |
96 |
97 |
98 |
--------------------------------------------------------------------------------
/python/README.md:
--------------------------------------------------------------------------------
1 | Revision Differ
2 |
3 | This script was written to be a streaming mapper for wikihadoop
4 | (see https://github.com/whym/wikihadoop). By default, this script runs under
5 | pypy (much faster), but it can also be run under CPython 2.7+.
6 |
7 | Required to run this script:
8 | - revision_differ.py (provided)
9 | - diff_match_patch.py (provided)
10 | - xml_simulator.py (provided)
11 | - wikimedia-utilities (https://bitbucket.org/halfak/wikimedia-utilities)
12 |
13 | Author: Aaron Halfaker (aaron.halfaker@gmail.com)
14 |
15 | This software licensed as GPLv2(http://www.gnu.org/licenses/gpl-2.0.html) and
16 | is provided WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
17 | implied.
18 |
--------------------------------------------------------------------------------
/python/example.py:
--------------------------------------------------------------------------------
1 | from StringIO import StringIO
2 | from diff_match_patch import diff_match_patch
3 | import re
4 |
5 | revs = [
6 | {'rev_id': 1, 'content':'Foo derp 263254'},
7 | {'rev_id': 2, 'content':'Foo derp 26354'}
8 | ]
9 |
10 | def tokenize(content):
11 | return re.findall(
12 | r"[\w]+" + #Word
13 | r"|\[\[" + #Opening internal link
14 | r"|\]\]" + #Closing internal link
15 | r"|\{\{" + #Opening template
16 | r"|\}\}" + #Closing template
17 | r"|\{\{\{" + #Opening template var
18 | r"|\}\}\}" + #Closing template var
19 | r"|\n+" + #Line breaks
20 | r"| +" + #Spaces
21 | r"|&\w+;" + #HTML escape sequence
22 | r"|'''" + #Bold
23 | r"|''" + #Italics
24 | r"|=+" + #Header
25 | r"|\{\|" + #Opening table
26 | r"|\|\}" + #Closing table
27 | r"|\|\-" + #Table row
28 | r"|.", #Misc character
29 | content
30 | )
31 |
32 | def hashTokens(tokens, hash2Token=[], token2Hash={}):
33 | hashBuffer = StringIO()
34 | for t in tokens:
35 | if t in token2Hash:
36 | hashBuffer.write(unichr(token2Hash[t]+1))
37 | else:
38 | hashId = len(hash2Token)
39 | hash2Token.append(t)
40 | token2Hash[t] = hashId
41 | hashBuffer.write(unichr(hashId+1))
42 |
43 | return (hashBuffer.getvalue(), hash2Token, token2Hash)
44 |
45 | def unhash(hashes, hash2Token, sep=''):
46 | return sep.join(hash2Token[ord(h)-1] for h in hashes)
47 |
48 | def simpleDiff(content1, content2, tokenize=tokenize, sep='', report=[-1,0,1]):
49 | hashes1, h2t, t2h = hashTokens(tokenize(content1))
50 | hashes2, h2t, t2h = hashTokens(tokenize(content2), h2t, t2h)
51 |
52 | report = set(report)
53 |
54 | dmp = diff_match_patch()
55 |
56 | diffs = dmp.diff_main(hashes1, hashes2, checklines=False)
57 |
58 | position = 0
59 | for (ar,hashes) in diffs:
60 | content = unhash(hashes,h2t,sep=sep)
61 | if ar in report:
62 | yield position, ar, content
63 |
64 | if ar != -1: position += len(content)
65 |
66 |
67 | def main():
68 |
69 | lastRev = {'content':''}
70 | content = ''
71 | for rev in revs:
72 | buff = StringIO()
73 | oldPos = 0
74 | lastPos = 0
75 | for pos, ar, c in simpleDiff(lastRev['content'], rev['content'], report=[-1,1]):
76 | equal = content[oldPos:oldPos+pos-lastPos]
77 | buff.write(equal)
78 | lastPos += len(equal)
79 | oldPos += len(equal)
80 |
81 | if ar == 1:
82 | buff.write(c)
83 | lastPos += len(c)
84 | elif ar == -1:
85 | oldPos += len(c)
86 |
87 |
88 | print("%s, %s, %r" % (pos, ar, c))
89 |
90 | buff.write(content[oldPos:])
91 |
92 |
93 | content = buff.getvalue()
94 | print("Rev: id=%s\n\t%r\n\t%r" % (rev['rev_id'], rev['content'], content))
95 | lastRev = rev
96 |
97 | content1 = open("content.2.txt", "r").read()
98 | hashes1, h2t, t2h = hashTokens(tokenize(content))
99 | print(len(hashes1))
100 |
101 | content = open("content.txt", "r").read()
102 | hashes2, h2t, t2h = hashTokens(tokenize(content), h2t, t2h)
103 | print(len(hashes2))
104 |
105 |
106 |
107 | if __name__ == "__main__": main()
108 |
109 |
--------------------------------------------------------------------------------
/python/page_sample.xml:
--------------------------------------------------------------------------------
1 |
2 | Bassist
3 | 60001
4 |
5 | 108204
6 | 2002-06-30T02:03:23Z
7 |
8 | 195.149.37.198
9 |
10 |
11 | stub
12 | A <b>bassist</b> is somebody who plays a [[bass guitar]] or [[double bass]].
13 |
14 |
15 | 208937
16 | 2002-06-30T16:00:41Z
17 |
18 | JeLuF
19 | 733
20 |
21 | added list
22 | A <b>bassist</b> is somebody who plays a [[bass guitar]] or [[double bass]].
23 |
24 | Famous bassists include:
25 | * [[Ron Carter]]
26 | * [[Les Claypool]] from [[Primus]]
27 | * [[John Entwistle]] from [[The Who]]
28 | * [[Kelly Grouchet]] from [[Electric Light Orchestra]]
29 | * [[Glenn Hughes]] from [[Deep Purple]]
30 | * [[Lemmy Kilmister]] from [[Motorhead]]
31 | * Sir [[Paul McCartney]] from [[The Beatles]]
32 | * [[Charles Mingus]]
33 | * [[Jason Newsted]] from [[Metallica]]
34 | * [[Sting]] from [[The Police]]
35 | * [[Leon Wilkeson]] from [[Lynyrd Skynyrd]]
36 |
37 |
38 |
39 | AccessibleComputing
40 | 10
41 |
42 | 100
43 | 2009-04-12T17:03:02Z
44 |
45 | foo bar
46 |
47 |
48 | 200
49 | 2009-04-12T17:03:02Z
50 |
51 | baz
52 |
53 |
54 |
55 | TestPage or something
56 | 9001
57 |
58 | 100
59 | 2009-04-12T17:03:02Z
60 |
61 | {| style="float: right; clear: right; background-color: transparent"
62 | | {{Infobox Military Conflict
63 | |conflict=Sinai and Palestine Campaign
64 | |partof=[[Middle Eastern theatre of World War I|Middle Eastern theatre]] ([[World War I]])
65 | |image=[[Image:Anzacsoldierandhorseinsinaiandpalestinecampaign.JPG|200px]]
66 | |caption=A model of a typical [[ANZAC]] soldier and his horse during the campaign
67 | |date=28 January 1915 - 28 October 1918
68 | |place=[[Sinai Peninsula]], [[Palestine]], and [[Syria]]
69 | |result=Allied Victory
70 | |territory=[[Partitioning of the Ottoman Empire]]
71 | |combatant1={{flagicon|United Kingdom}} [[British Empire]]<br>
72 | *{{flagicon|United Kingdom}} [[united Kingdom of Great Britain and Ireland|United Kingdom]]
73 | *{{flagicon|Australia}} [[Military history of Australia during World War I|Australia]]
74 | *{{flagicon|New Zealand}} [[Dominion of New Zealand|New Zealand]]
75 | *{{flagicon|India|British}} [[British Raj|India]]
76 | {{flag|France}}<br>{{flagicon|Italy|1861}} [[Kingdom of Italy (1861-1946)|Kingdom of Italy]]
77 | |combatant2={{flag|Ottoman Empire}}<br>{{flag|German Empire}}
78 | |commander1={{flagicon|United Kingdom}} [[John Maxwell (British Army officer)|Sir John Maxwell]]<br>{{flagicon|United Kingdom}} [[Sir Archibald Murray]]<br>{{flagicon|United Kingdom}} [[Philip Chetwode]]<br>{{flagicon|United Kingdom}} [[Charles Dobell]]<br>{{flagicon|United Kingdom}} [[Edmund Allenby]]<br>{{flagicon|Australia}} [[Henry George Chauvel]]<br>{{flagicon|United Kingdom}} [[Edward Bulfin]]
79 | |commander2={{flagicon|Ottoman Empire}} [[Ahmed Djemal|Djemal Pasha]]<br>{{flagicon|Ottoman Empire}} [[Jadir Bey]]<br>{{flagicon|Ottoman Empire}} [[Tala Bey]]<br>{{flagicon|German Empire}} [[Friedrich Freiherr Kress von Kressenstein]]<br>{{flagicon|German Empire}} [[Erich von Falkenhayn]]<br>{{flagicon|German Empire}} [[Otto Liman von Sanders]]
80 | |strength1=
81 | |strength2=
82 | |casualties1=
83 | |casualties2=
84 | |notes=
85 | }}
86 | |-
87 | |{{Campaignbox Sinai and Palestine}}{{WWITheatre}}
88 | |}
89 | The '''Sinai and Palestine Campaign''' during the [[Middle Eastern Theatre of World War I]] was a series of battles which took place in the [[Sinai Peninsula]], [[Ottoman Palestine]], and [[Syria]] between 28 January, 1915 and 28 October, 1918. [[United Kingdom|British]], [[British Indian Army|Indian]], [[Australia]]n, and [[New Zealand]] forces opposed the [[German Empire|German]] and [[Ottoman Empire|Turkish]] forces.
90 |
91 | As a result of several victories in Egypt in the late 19th Century, Britain gained control of that country and established a British protectorate there, soon after the beginning of the First World War. The Ottoman Empire also started to take an interest in Egypt quite early on in the war, possibly at the behest of Germany. The Suez Canal was their prime concern but unrest was also fomented by the Sanussi to the west of Cairo and to the south in Sudan.
92 |
93 | The Commander–in–Chief of the British Protectorate of Egypt, Major–General Sir John Maxwell [had fought in Egypt in the 1882 Battle of Tel el Kebir and in the Sudan in 1885 and 1898] describes his appointment and the situation in Egypt when he arrived –
94 |
95 | 'On August 29th, 1914 I was at the Headquarters of Marshal Joffre, at Vitry le Francois, where I received orders from Field–Marshall Earl Kitchener to proceed at once to Egypt and take over the command there. Somewhat disconcerted, I complied and arrived September 8th in that country.
96 |
97 | When I left France the French and British armies were in full retreat to the line of the Marne. Our little Army, after magnificent and strenuous resistance, had suffered terribly, and the question of reinforcements was paramount. It was, therefore, no surprise when, on my arrival in Egypt, I received orders to send every British soldier at once to England. I was informed that large forces were expected to be passing through the Suez Canal en route to Europe, and that a Territorial Division would be sent as soon as possible. The situation I found was by no means a pleasant one. The Turks were sitting on the fence, the Khedive Abbas was in Constantinople intriguing against us. The population of Egypt was some 12 millions, the great majority Moslems, in sympathy with their co–religionists the Turks; of the European population, the majority was Italian, Greek, German and Austrian, with a good proportion of Turks and Turko–Egyptians, Syrians and Armenians. The British and French were in a decided minority.' <ref>Powles, C. Guy, 'The New Zealanders in Sinai and Palestine' Volume III 'Official History New Zealand's Effort in the Great War' (Auckland, Christchurch, Dunedin and Wellington: Whitcombe & Tombs Ltd, 1922) p. vii</ref>
98 |
99 | ==Defence of Egypt – Eastern Frontier; Defence of Suez Canal==
100 |
101 | The Suez Canal very quickly became of great importance to both sides. To the Ottoman Empire the canal represented the closest and weakest link in British communications, being located in an erstwhile part of the Ottoman Empire. At the beginning of the war Egypt was still linked to the Ottoman Empire by its head of state which subsisted until the British Protectorate was declared.
102 |
103 | To the British the Suez Canal was of vital strategic importance. Instead of having to travel around the Cape of Good Hope, the Suez Canal cut the traveling time from Britain to India, New Zealand and Australia and was therefore vital, to the supporting of the British war effort in the European sector by the Colonies and Dominions.
104 |
105 | However at the beginning of the war, its defence posed a number of problems. There was no road to the canal, only one railway track crossed the thirty miles of desert from Cairo to Ismailia; thence north to Port Said and south to Suez. With Ismailia near the main gates and sluices captured the vital Nile fresh water these towns relied on would make their continued habitation very difficult and their strategic importance virtually nil.
106 |
107 | The Sinai was policed by a token defence force which very quickly evacuated the area in November 1914 leaving only very few troops on the eastern side of the Suez Canal. The 30,000 strong defenders were made up of two Indian infantry divisions and one Indian mounted brigade supported by Indian mountain artillery. They were the 10th and 11th Divisions and the Imperial Service Cavalry Brigade and they mounted their main defences on the Cairo side of the canal. The Ottoman Army very quickly advanced across Sinai and by February 1915 had staged attacks against all three towns on the canal with the major effort being in the centre at Ismailia. This force could rely on their being Allied shipping in the canal which could turn their ship's guns to their support and likely some observation balloons.
108 |
109 | ===Ottoman advance towards the Suez Canal===
110 | [[Image:MapSinaiWWI.jpg|thumb|left|<center>Map of north and central [[Sinai]], 1917</center>]]
111 | The [[Ottoman Empire]], at the urging of their German ally, chose to attack British and Egyptian forces in Egypt and shut the [[Suez Canal]] in the [[First Suez Offensive]]. The Ottoman Fourth Army, under the command of the Turkish Minister of Marine, [[Ahmed Djemal|Djemal Pasha]], was based in [[Jerusalem]]. At this time, the Sinai was an almost empty desert and very hard for an army to cross as there were neither roads nor water sources. The chief of staff for the Ottoman Fourth Army was the Bavarian Colonel [[Friedrich Freiherr Kress von Kressenstein|Kress von Kressenstein]], who organized the attack and managed to get supplies for the army as it crossed the desert.
112 |
113 | Under the leadership of Kress von Kressenstein, the Ottoman Army force began to move towards the Canal in mid January 1915 from their 'Princiapl Desert Base at Hafir el Auja in three echelons. [See Library of Congress's American Colony (Jerusalem) 1914-1917 Photo album Call Number LOT 13833; Photo Number 41 of 243; Photo Album 13709; external link below need to click 'next group' to group 37 to 48] The northern group moved via Magdhaba to El Arish and thence along the northern route towards Port Said. From Auja, the central group also the largest, moved via the water cisterns at Moiya Harab and the wells at Wady um Muksheib and Jifjafa towards Ismailia where the main gates and sluices vital for the pumping of Nile fresh water to the three towns on the canal were located. Without this water the towns would be very difficult to maintain and defend. Along with their artillery and supplies, this group brought with them flat bottomed boats in which troops could cross the canal. The third smaller group moved from Auja via Nekl towards Suez in the south. There were approximately 3,000 in the north and south columns and 6,000 in the central column, but there are no reliable German or Turkish sources for the numbers of enemy troops involved.
114 |
115 | [[File:Map 3 Sinai detail Keogh p.26.jpeg|thumb|Map 3 Sinai detail Keogh p. 26]]
116 |
117 | ===First Suez Offensive===
118 | {{Main|First Suez Offensive}}
119 | The Ottoman Suez Expeditionary Force arrived at the canal on 2 February, 1915. The attack failed to achieve surprise as the British and Egyptians were aware of the Ottoman army's approach. In fighting that lasted for two days the Ottomans were beaten, losing some 2000 men. Allied losses were minimal.{{Citation needed|date=November 2009}}
120 |
121 | ===1915 Actions on the Suez Canal 26 January to 4 February===
122 |
123 | The Defence of the Suez Canal campaign began on 26 February 1915 when subsidiary attacks were made near Kantara in the north and Suez in the south by Kress von Kressenstein's minor columns. The Battles Nomenclature Committee assigned the name 'Actions on the Suez Canal' to these operations which, according to the Committee ended with the rout of the enemy following the Battle of Romani on 12 August 1916. <ref>Battles Nomenclature Committee, Army. 'The Official Names of the Battles and Other Engagements Fought by the Military Forces of the British Empire during the Great War, 1914-1919, and the third Afghan War, 1919: Report of the Battles Nomenclature Committee as Approved by The Army Council Presented to Parliament by Command of His Majesty' (London, 1922), p. 31</ref> The major attack on the center about Ismailia by the main force early on the morning of 3 February 1915 when the enemy was successful in crossing the canal. However the attack failed to surprise the Indian defenders who kept the enemy from establishing itself on the Canal at a cost of about 700 casualties and 700 prisoners with the Indian Army loosing about 150 men. The enemy quickly retreated to the El Arish, Magdhaba, Aujah area from which position Kress von Kressenstein maintained a virtually continuous series of raids and attacks on the Canal endeavoring to disrupt traffic on the Suez Canal.
124 |
125 | Because the Suez Canal was vital to the Allied war effort, this failed attack caused the British to leave far more soldiers protecting the canal than they had planned on, resulting in a smaller force for the [[Battle of Gallipoli|Gallipoli Campaign]]. The British forced the colonial Egyptian Army and Egyptian Navy to be enlarged to help defend Egypt. However, most Egyptians were poorly-armed and poorly-trained.{{Citation needed|date=November 2009}}
126 |
127 | ===Improvements to Suez Canal Defences===
128 | In November 1915 Lord Kitchener had identified the weakness of basing the defence of Egypt on the Suez Canal and Kress von Kressenstein's raiding parties confirmed it. However it was not until towards the end of 1915 as the Gallipoli campaign was drawing to its conclusion that the War Cabinet in London authorised a new positions to be established about 10,000 yards east of the Canal in the desert to make the canal safe from long range guns and to provide additional troops to man them.
129 |
130 | Port Said became Headquarters with Kantara Advanced Headquarters of three sectors of the Canal defences –
131 | No. 1 (Southern) Suez to Kabrit HQ Suez
132 | No. 2 (Central) Kabrit to Ferdan HQ Ismailia
133 | No. 3 (Northern) Ferdan to Port Said
134 |
135 | ===1916 Forward Defence of Suez Canal===
136 | When these new defences were established and troops provided to man them, it was decided that the oasis area which stretched westwards towards the Canal from Bir el Abd to Romani and Katia along the ancient silk road needed to be denied to the enemy. Kress von Kressenstein and his forces had made use of this area of reliable drinking water during the previous fighting.
137 |
138 | In order to carry out this plan it was necessary to build a pipeline for the fresh Nile water to be pumped to the troops as they moved out eastwards. A railway was also required to provide supplies and move troops quickly and the laying of rails and sleepers by Egyptian Labour Force soon moved out past the new canal defences making it necessary to send out a brigade to protect the workers and the infrastructure they were building.
139 |
140 | ===Operations to destroy the water on the central road across Sinai===
141 | As long as the water cistern and wells on the central road remained intact, the enemy could move across the Sinai Peninsular to threaten the Canal at any time. The decision was taken in March 1916 for these water sources to be destroyed and the 8th Light Horse Regiment and Birkani Camel Corps were sent to Wady um Muksheib and Moya Harab on 21 March while the 9th Light Horse Regiment, camels and supporting engineers, etc. destroyed the water wells and their pumping equipment on 11 April at Jifjafa.
142 |
143 | ===Affair of Katia===
144 | {{Main|Affair of Katia}}
145 |
146 | This attack by the Ottoman Army on St George's Day 23 April 1916, was possibly a response to the increased presence of the Allies, some distance eastward from the Suez Canal. The 5th Mounted Yeomanry Brigade was spread out at Katia, Bir el Mageibra, Bir el Hamisah and Oghratina where they were surprised and overwhelmed by the enemy. <ref>Wavell, pp. 43–5</ref> All these places are in the vicinity of Romani and played a part in that Battle.
147 |
148 | ===Battle of Romani===
149 | {{Main|Battle of Romani}}
150 | More than a year passed with the British troops content to guard the Suez Canal, and the Ottomans busy fighting the Russians in the Caucusus and the British at Gallipoli and in Mesopotamia. Then in July 1916, the Ottoman army tried another offensive against the Suez Canal. Again, the Ottomans advanced with an over-sized division. Again they ran into a well prepared Allied force, this time at Romani. Again, they retreated after two days of fighting from 3 August to 5 August, 1916.
151 |
152 | Following this victory, the Allied forces sought to prevent the Turkish Canal Expeditionary Force threatening the Suez Canal, by removing them from Bir el Abd. On 9 August, 1916, an indecisive action was fought at Bir el Abd, leading to the Turkish withdrawal to El Arish while leaving a rear guard force at Bir el Mazar.
153 |
154 | ==British advance across the Sinai==
155 | This attack convinced the British to push their defence of the Canal further out, into the Sinai, and so starting in October, the British under Lieutenant General Sir [[Charles Dobell]] began operations into the Sinai desert and on to the border of Palestine. Initial efforts were limited to building a railway and a waterline across the Sinai. After several months building up supplies and troops, the British were ready for an attack. The first battle was the capture of a fortified position at [[Battle of Magdhaba|Magdhaba]] on 23 December, 1916.
156 |
157 | On 8 January, 1917, the [[Anzac Mounted Division]] attacked the fortified town of [[Battle of Rafa|Rafa]]. The attack was successful and the majority of the Turkish garrison was captured. The British had accomplished their objective of protecting the Suez Canal from Turkish attacks, but the new government of [[David Lloyd George]] wanted more.
158 |
159 | ==Palestine campaign==<!-- This section is linked from [[Edmund Allenby, 1st Viscount Allenby]] -->
160 | {{Unreferenced section|date=November 2009}}
161 | [[File:Turkish trenches at Dead Sea2.jpg|right|thumb|Turkish trenches at the shores of the [[Dead Sea]], 1917.]]
162 | The British army in Egypt was ordered to go on the offensive against the [[Ottoman Turks]] in Palestine. In part this was to support the [[Arab revolt]] which had started early in 1916, but also to accomplish something positive after the years of fruitless battles on the [[Western Front (World War I)|Western Front]]. The British commander in Egypt, Sir [[Archibald Murray]], suggested that he needed more troops and ships, but this request was refused.
163 |
164 | [[Image:Sinai-WW1-1.jpg|thumb|300px|left|Assault on [[Gaza]], 1917]]
165 | The Ottoman forces were holding a rough line from the fort at [[Gaza]], on the shore of the [[Mediterranean Sea]], to the town of [[Beersheba]], which was the terminus of the Ottoman railway that extended north to Damascus. The British commander in the field, Dobell, chose to attack Gaza, using a short hook move on 26 March, 1917.
166 |
167 | ===First Battle of Gaza===
168 | {{Main|First Battle of Gaza}}
169 | The British attack was essentially a failure. Due to miscommunication, some units retreated when they should have held onto their gains and so the fortress was not taken.
170 |
171 | The government in London believed the reports from the field which indicated a substantial victory had been won and ordered General Murray to move on and capture [[Jerusalem]]. The British were in no position to attack Jerusalem as they had yet to break through the Ottoman defensive positions. These positions were rapidly improved and credit for the Turkish defence is given to the German chief-of-staff [[Friedrich Freiherr Kress von Kressenstein|Baron Kress von Kressenstein]].
172 |
173 | ===Second Battle of Gaza===
174 | {{Main|Second Battle of Gaza}}
175 | A second attack on the fort of Gaza was launched one month later on 17 April, 1917. This attack, supported by naval gunfire, chlorine gas and even a few early [[Mark I (tank)|tanks]], was also a failure. It was essentially a frontal assault on a fortified position, and its failure was due more to inflexibility in operations than to faults in planning; yet it cost some 6,000 British casualties. As a result both General Dobell and General Murray were removed from command. The new man put in charge was General Sir [[Edmund Allenby]] and his orders were clear: take [[Jerusalem]] by Christmas.
176 |
177 | After personally reviewing the Ottoman defensive positions, Allenby requested reinforcements: three more infantry divisions, aircraft, and artillery. This request was granted and by October, 1917, the British were ready for their next attack.
178 |
179 | The Ottoman army had three active fronts at this time: [[Mesopotamian Campaign|Mesopotamia]], Arabia, and the Gaza front. They also had substantial forces deployed around [[Constantinople]] and in the (now quiet) Caucasus front. Given all these demands, the army in Gaza was only about 35,000 strong, led by the Ottoman General [[Kustafa]] and concentrated in three main defensive locations: Gaza, Tel Es Sheria, and Beersheba. Allenby's army was now much larger, with some 88,000 troops in good condition and well-equipped.
180 |
181 | ===Battle of El Buggar Ridge===
182 | {{Main|Battle of El Buggar Ridge}}
183 | The occupation of Karm by the Allies on 22 October, 1917 created a major point for supply and water for the troops in the immediate area. For the Ottoman forces, the establishment of a railway station at Karm placed the defensive positions known as the Hureira Redoubt and Rushdie System which formed a powerful bulwark against any Allied action under threat.
184 |
185 | To forestall this threat, General Erich von Falkenhayn, the Commander of the Yildirim Group, proposed a two phase attack. The plan called for a reconnaissance in force from Beersheba on 27 October, to be followed by an all out attack launched by the 8th Army from Hureira. This second phase was ironically scheduled to occur on the morning of 31 October, 1917, the day when the Battle of Beersheba began.
186 |
187 | ===Battle of Beersheba===
188 | {{Main|Battle of Beersheba (1917)}}
189 | A key feature of the British plan was to convince the Turks (and their German leaders) that once again, Gaza was to be attacked. This deception campaign was extremely thorough and convincing. The [[Battle of El Buggar Ridge]], initiated by the Turks, completed the deception. When the Allies launched their attack on Beersheba, the Turks were taken by surprise. In one of the most remarkable feats of planning and execution, the Allies were able to move some 40,000 men and a similar number of horses over hostile and inhospitable terrain without being detected by the Turks. The climax of the battle was one of the last successful cavalry charges of modern warfare, when two Australian Light Horse regiments (4th and 12th) charged across open ground just before dusk and captured the town.
190 |
191 | The Turkish defeat at Beersheba on 31 October was not a complete rout. The Turks retreated into the hills and prepared defensive positions to the north of Beersheba. For the Allies, the following days were spent fighting a difficult and bloody battle at Tel el Khuweilifeh, to the north east of Beersheba.
192 |
193 | [[Image:Palestine-WW1-2.jpg|thumb|300px|right|Allenby's Offensive, November-December 1917]]
194 | To break through the Turkish defensive line, the Allied forces attacked the Ottoman positions at Tel Es Sheria on 6 November, and followed this up with a further attack at Huj the following day. With the imminent collapse of Gaza at the same time, the Turks quickly retreated to a new line of defence.
195 |
196 | ===Third Battle of Gaza===
197 | {{Main|Third Battle of Gaza}}
198 | On 7 November, the British attacked Gaza for the third time. The Turks, worried about being cut off, retreated in the face of the British assault. Gaza had finally been captured.
199 |
200 | The Turkish defensive position was shattered, the Ottoman army was retreating in some disarray, and General Allenby ordered his army to pursue the enemy. The British followed closely on the heels of the retreating Ottoman forces. An attempt by the Turks to form a defence of a place called Junction Station (Wadi Sarar) was foiled by a British attack on 13 November. General Falkenhayn next tried to form a new defensive line from [[Bethlehem]] to Jerusalem to [[Jaffa]]. The first British attack on Jerusalem failed but with a short rest and the gathering of more infantry divisions, Allenby tried again and on 9 December, 1917, Jerusalem was captured. This was a major political event for the British government of David Lloyd George, one of the few real successes the British could point to after three long bloody years of war.
201 |
202 | On the Turkish side, this defeat marked the exit of Djemal Pasha, who returned to [[Istanbul]]. Djemal had delegated the actual command of his army to German officers such as von Kressenstein and von Falkenhayn more than a year earlier, but now, defeated as [[Enver Pasha]] had been at the [[Battle of Sarikamis]], he gave up even nominal command and returned to the capital. Less than a year remained before he was forced out of the government. General Falkenhayn was also replaced, in March 1918.
203 |
204 | == The final year: Palestine and Syria ==
205 | [[Image:Palestine-WW1-3.jpg|thumb|230px|left|Allenby's Final Attack, September 1918]]
206 | The British government had hopes that the Ottoman Empire could be defeated early in the coming year with successful campaigns in Palestine and Mesopotamia but the [[Spring Offensive]] by the Germans on the Western Front delayed the expected attack on Syria for nine full months. General Allenby's army was largely redeployed to France and most of his divisions were rebuilt with units recently recruited in India. His forces spent much of the summer of 1918 training and reorganising.
207 |
208 | Because the British achieved complete control of the air with their new [[Sopwith Camel|fighter planes]], the Turks, and their new German commander, General [[Otto Liman von Sanders|Liman von Sanders]], had no clear idea where the British were going to attack. Compounding the problems, the Turks, at the direction of their [[War Minister]] [[Enver Pasha]] withdrew their best troops during the summer for the creation of Enver's [[Ottoman Army of Islam|Army of Islam]], leaving behind poor quality, dispirited soldiers. During this time, the Turks were distracted by raids against their open desert (eastern) flank by forces of the Arab Revolt commanded by the [[Faisal I of Iraq|Emir Feisal]] and coordinated by [[T. E. Lawrence]] and other British liaison officers, which tied down thousands of soldiers in garrisons throughout Palestine, [[Jordan]], and Syria.
209 |
210 | ===Battle of Megiddo===
211 | {{Main|Battle of Megiddo (1918)}}
212 | General Allenby finally launched his long-delayed attack on 19 September, 1918. The campaign has been called the Battle of Megiddo (which is a transliteration of the Hebrew name of an ancient town known in the west as [[Armageddon]]). Again, the British made major efforts to deceive the Turks as to their actual intended target of operations. This effort was, again, successful and the Turks were taken by surprise when the British attacked Meggido in a sudden storm. The Turkish troops started a full scale retreat, the British bombed the fleeing columns of men from the air and within a week, the Turkish army in Palestine had ceased to exist as a military force.
213 |
214 | The ultimate goal of Allenby's and Feisal's armies was [[Damascus]]. Two separate Allied columns marched towards Damascus. The first, composed mainly of Australian and Indian cavalry, approached from Galilee, while the other column, consisting of Indian cavalry and the ''ad hoc'' militia following T.E. Lawrence, travelled northwards along the [[Hejaz Railway]]. Australian Light Horse troops marched unopposed into Damascus on 1 October, 1918, despite the presence of some 12,000 Turkish soldiers at Baramke Barracks. Major Olden of the Australian 10th Light Horse Regiment received the Official Surrender of the City at 7 am at the Serai. Later that day, Lawrence's irregulars entered Damascus to claim full credit for its capture.
215 |
216 | The war in Palestine was over but in Syria lasted for a further month. The Turkish government was quite prepared to sacrifice these non-Turkish provinces without surrendering. Indeed, while this battle was raging, the Turks sent an expeditionary force into Russia to enlarge the ethnic Turkish elements of the empire. It was only after the surrender of Bulgaria, which put Turkey into a vulnerable position for invasion, that the Turkish government was compelled to sign an armistice on 30 October, 1918, and surrendered outright two days later. Six hundred years of Ottoman rule over the [[Middle East]] had come to an end.
217 |
218 | == In popular media ==
219 | This campaign has been depicted in several films. The most famous is ''[[Lawrence of Arabia (film)|Lawrence of Arabia]]'' (1962), though it focused primarily on T.E. Lawrence and the Arab Revolt. Other films dealing with this topic include ''[[Forty Thousand Horsemen]]'' (1941), and ''[[The Lighthorsemen (film)|The Lighthorsemen]]'' (1987), with [[Peter Phelps]] and [[Nick Waters]], both of which focused on the role of the ANZAC forces during the campaign.
220 |
221 | ==Summary==
222 | The British suffered a total of 550,000 casualties: more than 90% of these were not battle losses but instead attributable to disease, heat and other secondary causes. Total Turkish losses are unknown but almost certainly larger: an entire army was lost in the fighting and the Turks poured a vast number of troops into the front over the three years of combat.
223 |
224 | Despite the uncertainty of casualty counts, the historical consequences of this campaign are hard to overestimate. The British conquest of Palestine led directly to the [[British mandate]] over Palestine and the [[Trans-Jordan]] which, in turn, paved the way for the creation of the states of [[Israel]], [[Jordan]], [[Lebanon]], and [[Syria]].
225 |
226 | ==References==
227 | {{Reflist}}
228 |
229 | ==See also==
230 | {{portal|World War I}}
231 | *[[Bund der Asienkämpfer]]
232 | {{Commonscat-inline|Sinai and Palestine Campaign}}
233 |
234 | ==External links==
235 | * First World War.com. [http://www.firstworldwar.com/battles/suez.htm Defence of the Suez Canal, 1915]. Retrieved 19 December, 2005.
236 | * [http://alh-research.tripod.com/Light_Horse/ Australian Light Horse Studies Centre]
237 | * [http://www.turkeyswar.com/campaigns/palestine1.htm Palestine pages of 'Turkey in WW1' web site]
238 | * [http://www.nzhistory.net.nz/node/13507 Sinai campaign (NZHistory.net.nz)]
239 | * [http://www.nzhistory.net.nz/node/14256 Palestine campaign (NZHistory.net.nz)]
240 | * [http://www.ottomanpalestine.com/GALLERY_1.htm The Photographs of Palestine Campaign]
241 | * [http://hdl.loc.gov/loc.pnp/ppmsca.13709 Library of Congress's American Colony in Jerusalem's Photo Album]
242 |
243 | ==Sources==
244 | * Battles Nomenclature Committee, Army. ''The Official Names of the Battles and Other Engagements Fought by the Military Forces of the British Empire during the Great War, 1914-1919, and the third Afghan War, 1919: Report of the Battles Nomenclature Committee as Approved by The Army Council Presented to Parliament by Command of His Majesty'' (London, 1922).
245 | * Jean Bou, ''A History of Australia's Mounted Arm'' Series: Australian Army History Series (Port Melbourne: Cambridge University Press, 2009).
246 | * Bruce, Anthony (2002). ''The Last Crusade: The Palestinian Campaign in the First World War''. John Murray.
247 | * Field Marshal Lord Carver, ''The National Army Museum Book of The Turkish Front 1914-1918 The Campaigns at Gallipoli, in Mesopotamia and in Palestine'' (London: Pan Macmillan, 2003).
248 | * R. M. Downes, ''The Campaign in Sinai and Palestine'' Part II in Volume 1 ''Gallipoli, Palestine and New Guinea'' of A. G. Butler, ''Official History of the Australian Army Medical Services, 1914–1918'' (2nd edition 1938) p. 553. On line at Australian War Memorial; Official Histories.
249 | * Erickson, Edward J., ''Ordered to Die A History of the Ottoman Army in the First World War'' Forward by General Hüseyiln Kivrikoglu Contributions in Military Studies, No. 201 (Westport Connecticut: Greenwood Press, 2001).
250 | * Esposito, Vincent (ed.) (1959). ''The West Point Atlas of American Wars - Vol. 2''. Frederick Praeger Press.
251 | * Fromkin, David (1989). ''A Peace to End All Peace''. Avon Books.
252 | * Grainger, John D. (2006) ''The Battle for Palestine: 1917'' Boydell Press. ISBN 1 84383 263 1
253 | * Keegan, John (1998). ''The First World War''. Random House Press.
254 | * E.G. Keogh, ''Suez to Aleppo'' (Melbourne: Directorate of Military Training, 1955).
255 | * Preston, Lieutenant-Colonel Richard Martin (1921) ''The Desert Mounted Corps: An Account of the Cavalry Operations in Palestine and Syria 1914 to 1918''. Houghton Mifflin Company. [http://books.google.com/books?id=LHg5xNCFDGsC Google Books Search]
256 | * Powles, C. Guy, ''The New Zealanders in Sinai and Palestine'' Volume III ''Official History New Zealand's Effort in the Great War'' (Auckland, Christchurch, Dunedin and Wellington: Whitcombe & Tombs Ltd, 1922).
257 | * War Diaries of 1st, 2nd and 3rd Light Horse Brigades. [available on the Australian War Memorial's web site]
258 | * Field Marshal Earl Wavell, ''The Palestine Campaigns'' 3rd Edition thirteenth Printing; Series: A Short History of the British Army 4th Edition by Major E.W. Sheppard (London: Constable & Co. 1968).
259 | * Woodward, David R (2006). ''Forgotten Soldiers of the First World War - Lost Voices from the Middle Eastern Front''. Tempus Publishing.
260 |
261 | {{World War I}}
262 |
263 | [[Category:Ottoman Empire and World War I]]
264 | [[Category:Middle Eastern theatre of World War I| ]]
265 | [[Category:Campaigns and theatres of World War I|Sinai and Palestine]]
266 | [[Category:Military campaigns and theatres of World War I involving Australia]]
267 |
268 | [[es:Campaña del Sinaí y Palestina]]
269 | [[he:המערכה על סיני וארץ ישראל במלחמת העולם הראשונה]]
270 | [[hu:Palesztin front (első világháború)]]
271 | [[pt:Campanha do Sinai e Palestina]]
272 | [[ru:Синайско-Палестинская кампания]]
273 | [[sr:Синајски и палестински поход]]
274 | [[tr:Sina ve Filistin Cephesi]]
275 |
276 |
277 | 200
278 | 2009-04-12T17:03:02Z
279 |
280 | {| style="float: right; clear: right; background-color: transparent"
281 | | {{Infobox Military Conflict
282 | |conflict=Sinai and Palestine Campaign
283 | |partof=[[Middle Eastern theatre of World War I|Middle Eastern theatre]] ([[World War I]])
284 | |image=[[Image:Anzacsoldierandhorseinsinaiandpalestinecampaign.JPG|200px]]
285 | |caption=A model of a typical [[ANZAC]] soldier and his horse during the campaign
286 | |date=28 January 1915 - 28 October 1918
287 | |place=[[Sinai Peninsula]], [[Palestine]], and [[Syria]]
288 | |result=Allied Victory
289 | |territory=[[Partitioning of the Ottoman Empire]]
290 | |combatant1={{flagicon|United Kingdom}} [[British Empire]]<br>
291 | *{{flagicon|United Kingdom}} [[united Kingdom of Great Britain and Ireland|United Kingdom]]
292 | *{{flagicon|Australia}} [[Military history of Australia during World War I|Australia]]
293 | *{{flagicon|New Zealand}} [[Dominion of New Zealand|New Zealand]]
294 | *{{flagicon|India|British}} [[British Raj|India]]
295 | {{flag|France}}<br>{{flagicon|Italy|1861}} [[Kingdom of Italy (1861-1946)|Kingdom of Italy]]
296 | |combatant2={{flag|Ottoman Empire}}<br>{{flag|German Empire}}
297 | |commander1={{flagicon|United Kingdom}} [[John Maxwell (British Army officer)|Sir John Maxwell]]<br>{{flagicon|United Kingdom}} [[Sir Archibald Murray]]<br>{{flagicon|United Kingdom}} [[Philip Chetwode]]<br>{{flagicon|United Kingdom}} [[Charles Dobell]]<br>{{flagicon|United Kingdom}} [[Edmund Allenby]]<br>{{flagicon|Australia}} [[Henry George Chauvel]]<br>{{flagicon|United Kingdom}} [[Edward Bulfin]]
298 | |commander2={{flagicon|Ottoman Empire}} [[Ahmed Djemal|Djemal Pasha]]<br>{{flagicon|Ottoman Empire}} [[Jadir Bey]]<br>{{flagicon|Ottoman Empire}} [[Tala Bey]]<br>{{flagicon|German Empire}} [[Friedrich Freiherr Kress von Kressenstein]]<br>{{flagicon|German Empire}} [[Erich von Falkenhayn]]<br>{{flagicon|German Empire}} [[Otto Liman von Sanders]]
299 | |strength1=
300 | |strength2=
301 | |casualties1=
302 | |casualties2=
303 | |notes=
304 | }}
305 | |-
306 | |{{Campaignbox Sinai and Palestine}}{{WWITheatre}}
307 | |}
308 | The '''Sinai and Palestine Campaign''' during the [[Middle Eastern Theatre of World War I]] was a series of battles which took place in the [[Sinai Peninsula]], [[Ottoman Palestine]], and [[Syria]] between 28 January, 1915 and 28 October, 1918. [[United Kingdom|British]], [[British Indian Army|Indian]], [[Australia]]n, and [[New Zealand]] forces opposed the [[German Empire|German]] and [[Ottoman Empire|Turkish]] forces.
309 |
310 | As a result of several victories in Egypt in the late 19th Century, Britain gained control of that country and established a British protectorate there, soon after the beginning of the First World War. The Ottoman Empire also started to take an interest in Egypt quite early on in the war, possibly at the behest of Germany. The Suez Canal was their prime concern but unrest was also fomented by the Sanussi to the west of Cairo and to the south in Sudan.
311 |
312 | The Commander–in–Chief of the British Protectorate of Egypt, Major–General Sir John Maxwell [had fought in Egypt in the 1882 Battle of Tel el Kebir and in the Sudan in 1885 and 1898] describes his appointment and the situation in Egypt when he arrived –
313 |
314 | 'On August 29th, 1914 I was at the Headquarters of Marshal Joffre, at Vitry le Francois, where I received orders from Field–Marshall Earl Kitchener to proceed at once to Egypt and take over the command there. Somewhat disconcerted, I complied and arrived September 8th in that country.
315 |
316 | When I left France the French and British armies were in full retreat to the line of the Marne. Our little Army, after magnificent and strenuous resistance, had suffered terribly, and the question of reinforcements was paramount. It was, therefore, no surprise when, on my arrival in Egypt, I received orders to send every British soldier at once to England. I was informed that large forces were expected to be passing through the Suez Canal en route to Europe, and that a Territorial Division would be sent as soon as possible. The situation I found was by no means a pleasant one. The Turks were sitting on the fence, the Khedive Abbas was in Constantinople intriguing against us. The population of Egypt was some 12 millions, the great majority Moslems, in sympathy with their co–religionists the Turks; of the European population, the majority was Italian, Greek, German and Austrian, with a good proportion of Turks and Turko–Egyptians, Syrians and Armenians. The British and French were in a decided minority.' <ref>Powles, C. Guy, 'The New Zealanders in Sinai and Palestine' Volume III 'Official History New Zealand's Effort in the Great War' (Auckland, Christchurch, Dunedin and Wellington: Whitcombe & Tombs Ltd, 1922) p. vii</ref>
317 |
318 | ==Defence of Egypt – Eastern Frontier; Defence of Suez Canal==
319 |
320 | The Suez Canal very quickly became of great importance to both sides. To the Ottoman Empire the canal represented the closest and weakest link in British communications, being located in an erstwhile part of the Ottoman Empire. At the beginning of the war Egypt was still linked to the Ottoman Empire by its head of state which subsisted until the British Protectorate was declared.
321 |
322 | To the British the Suez Canal was of vital strategic importance. Instead of having to travel around the Cape of Good Hope, the Suez Canal cut the traveling time from Britain to India, New Zealand and Australia and was therefore vital, to the supporting of the British war effort in the European sector by the Colonies and Dominions.
323 |
324 | However at the beginning of the war, its defence posed a number of problems. There was no road to the canal, only one railway track crossed the thirty miles of desert from Cairo to Ismailia; thence north to Port Said and south to Suez. With Ismailia near the main gates and sluices captured the vital Nile fresh water these towns relied on would make their continued habitation very difficult and their strategic importance virtually nil.
325 |
326 | The Sinai was policed by a token defence force which very quickly evacuated the area in November 1914 leaving only very few troops on the eastern side of the Suez Canal. The 30,000 strong defenders were made up of two Indian infantry divisions and one Indian mounted brigade supported by Indian mountain artillery. They were the 10th and 11th Divisions and the Imperial Service Cavalry Brigade and they mounted their main defences on the Cairo side of the canal. The Ottoman Army very quickly advanced across Sinai and by February 1915 had staged attacks against all three towns on the canal with the major effort being in the centre at Ismailia. This force could rely on their being Allied shipping in the canal which could turn their ship's guns to their support and likely some observation balloons.
327 |
328 | ===Ottoman advance towards the Suez Canal===
329 | [[Image:MapSinaiWWI.jpg|thumb|left|<center>Map of north and central [[Sinai]], 1917</center>]]
330 | The [[Ottoman Empire]], at the urging of their German ally, chose to attack British and Egyptian forces in Egypt and shut the [[Suez Canal]] in the [[First Suez Offensive]]. The Ottoman Fourth Army, under the command of the Turkish Minister of Marine, [[Ahmed Djemal|Djemal Pasha]], was based in [[Jerusalem]]. At this time, the Sinai was an almost empty desert and very hard for an army to cross as there were neither roads nor water sources. The chief of staff for the Ottoman Fourth Army was the Bavarian Colonel [[Friedrich Freiherr Kress von Kressenstein|Kress von Kressenstein]], who organized the attack and managed to get supplies for the army as it crossed the desert.
331 |
332 | Under the leadership of Kress von Kressenstein, the Ottoman Army force began to move towards the Canal in mid January 1915 from their 'Princiapl Desert Base at Hafir el Auja in three echelons. [See Library of Congress's American Colony (Jerusalem) 1914-1917 Photo album Call Number LOT 13833; Photo Number 41 of 243; Photo Album 13709; external link below need to click 'next group' to group 37 to 48] The northern group moved via Magdhaba to El Arish and thence along the northern route towards Port Said. From Auja, the central group also the largest, moved via the water cisterns at Moiya Harab and the wells at Wady um Muksheib and Jifjafa towards Ismailia where the main gates and sluices vital for the pumping of Nile fresh water to the three towns on the canal were located. Without this water the towns would be very difficult to maintain and defend. Along with their artillery and supplies, this group brought with them flat bottomed boats in which troops could cross the canal. The third smaller group moved from Auja via Nekl towards Suez in the south. There were approximately 3,000 in the north and south columns and 6,000 in the central column, but there are no reliable German or Turkish sources for the numbers of enemy troops involved.
333 |
334 | [[File:Map 3 Sinai detail Keogh p.26.jpeg|thumb|Map 3 Sinai detail Keogh p. 26]]
335 |
336 | ===First Suez Offensive===
337 | {{Main|First Suez Offensive}}
338 | The Ottoman Suez Expeditionary Force arrived at the canal on 2 February, 1915. The attack failed to achieve surprise as the British and Egyptians were aware of the Ottoman army's approach. In fighting that lasted for two days the Ottomans were beaten, losing some 2000 men. Allied losses were minimal.{{Citation needed|date=November 2009}}
339 |
340 | ===1915 Actions on the Suez Canal 26 January to 4 February===
341 |
342 | The Defence of the Suez Canal campaign began on 26 February 1915 when subsidiary attacks were made near Kantara in the north and Suez in the south by Kress von Kressenstein's minor columns. The Battles Nomenclature Committee assigned the name 'Actions on the Suez Canal' to these operations which, according to the Committee ended with the rout of the enemy following the Battle of Romani on 12 August 1916. <ref>Battles Nomenclature Committee, Army. 'The Official Names of the Battles and Other Engagements Fought by the Military Forces of the British Empire during the Great War, 1914-1919, and the third Afghan War, 1919: Report of the Battles Nomenclature Committee as Approved by The Army Council Presented to Parliament by Command of His Majesty' (London, 1922), p. 31</ref> The major attack on the center about Ismailia by the main force early on the morning of 3 February 1915 when the enemy was successful in crossing the canal. However the attack failed to surprise the Indian defenders who kept the enemy from establishing itself on the Canal at a cost of about 700 casualties and 700 prisoners with the Indian Army loosing about 150 men. The enemy quickly retreated to the El Arish, Magdhaba, Aujah area from which position Kress von Kressenstein maintained a virtually continuous series of raids and attacks on the Canal endeavoring to disrupt traffic on the Suez Canal.
343 |
344 | Because the Suez Canal was vital to the Allied war effort, this failed attack caused the British to leave far more soldiers protecting the canal than they had planned on, resulting in a smaller force for the [[Battle of Gallipoli|Gallipoli Campaign]]. The British forced the colonial Egyptian Army and Egyptian Navy to be enlarged to help defend Egypt. However, most Egyptians were poorly-armed and poorly-trained.{{Citation needed|date=November 2009}}
345 |
346 | ===Improvements to Suez Canal Defences===
347 | In November 1915 Lord Kitchener had identified the weakness of basing the defence of Egypt on the Suez Canal and Kress von Kressenstein's raiding parties confirmed it. However it was not until towards the end of 1915 as the Gallipoli campaign was drawing to its conclusion that the War Cabinet in London authorised a new positions to be established about 10,000 yards east of the Canal in the desert to make the canal safe from long range guns and to provide additional troops to man them.
348 |
349 | Port Said became Headquarters with Kantara Advanced Headquarters of three sectors of the Canal defences –
350 | No. 1 (Southern) Suez to Kabrit HQ Suez
351 | No. 2 (Central) Kabrit to Ferdan HQ Ismailia
352 | No. 3 (Northern) Ferdan to Port Said
353 |
354 | ===1916 Forward Defence of Suez Canal===
355 | When these new defences were established and troops provided to man them, it was decided that the oasis area which stretched westwards towards the Canal from Bir el Abd to Romani and Katia along the ancient silk road needed to be denied to the enemy. Kress von Kressenstein and his forces had made use of this area of reliable drinking water during the previous fighting.
356 |
357 | In order to carry out this plan it was necessary to build a pipeline for the fresh Nile water to be pumped to the troops as they moved out eastwards. A railway was also required to provide supplies and move troops quickly and the laying of rails and sleepers by Egyptian Labour Force soon moved out past the new canal defences making it necessary to send out a brigade to protect the workers and the infrastructure they were building.
358 |
359 | ===Operations to destroy the water on the central road across Sinai===
360 | As long as the water cistern and wells on the central road remained intact, the enemy could move across the Sinai Peninsular to threaten the Canal at any time. The decision was taken in March 1916 for these water sources to be destroyed and the 8th Light Horse Regiment and Birkani Camel Corps were sent to Wady um Muksheib and Moya Harab on 21 March while the 9th Light Horse Regiment, camels and supporting engineers, and, according to the 3rd Light Horse Brigade's War Diary, 30 Light Horsemen armed as Lancers, destroyed the water wells and their pumping equipment on 11 April at Jifjafa.
361 |
362 | ===Affair of Katia===
363 | {{Main|Affair of Katia}}
364 |
365 | This attack by the Ottoman Army on St George's Day 23 April 1916, was possibly a response to the increased presence of the Allies, some distance eastward from the Suez Canal. The 5th Mounted Yeomanry Brigade was spread out at Katia, Bir el Mageibra, Bir el Hamisah and Oghratina where they were surprised and overwhelmed by the enemy. <ref>Wavell, pp. 43–5</ref> All these places are in the vicinity of Romani and played a part in that Battle.
366 |
367 | ===Battle of Romani===
368 | {{Main|Battle of Romani}}
369 | More than a year passed with the British troops content to guard the Suez Canal, and the Ottomans busy fighting the Russians in the Caucusus and the British at Gallipoli and in Mesopotamia. Then in July 1916, the Ottoman army tried another offensive against the Suez Canal. Again, the Ottomans advanced with an over-sized division. Again they ran into a well prepared Allied force, this time at Romani. Again, they retreated after two days of fighting from 3 August to 5 August, 1916.
370 |
371 | Following this victory, the Allied forces sought to prevent the Turkish Canal Expeditionary Force threatening the Suez Canal, by removing them from Bir el Abd. On 9 August, 1916, an indecisive action was fought at Bir el Abd, leading to the Turkish withdrawal to El Arish while leaving a rear guard force at Bir el Mazar.
372 |
373 | ==British advance across the Sinai==
374 | This attack convinced the British to push their defence of the Canal further out, into the Sinai, and so starting in October, the British under Lieutenant General Sir [[Charles Dobell]] began operations into the Sinai desert and on to the border of Palestine. Initial efforts were limited to building a railway and a waterline across the Sinai. After several months building up supplies and troops, the British were ready for an attack. The first battle was the capture of a fortified position at [[Battle of Magdhaba|Magdhaba]] on 23 December, 1916.
375 |
376 | On 8 January, 1917, the [[Anzac Mounted Division]] attacked the fortified town of [[Battle of Rafa|Rafa]]. The attack was successful and the majority of the Turkish garrison was captured. The British had accomplished their objective of protecting the Suez Canal from Turkish attacks, but the new government of [[David Lloyd George]] wanted more.
377 |
378 | ==Palestine campaign==<!-- This section is linked from [[Edmund Allenby, 1st Viscount Allenby]] -->
379 | {{Unreferenced section|date=November 2009}}
380 | [[File:Turkish trenches at Dead Sea2.jpg|right|thumb|Turkish trenches at the shores of the [[Dead Sea]], 1917.]]
381 | The British army in Egypt was ordered to go on the offensive against the [[Ottoman Turks]] in Palestine. In part this was to support the [[Arab revolt]] which had started early in 1916, but also to accomplish something positive after the years of fruitless battles on the [[Western Front (World War I)|Western Front]]. The British commander in Egypt, Sir [[Archibald Murray]], suggested that he needed more troops and ships, but this request was refused.
382 |
383 | [[Image:Sinai-WW1-1.jpg|thumb|300px|left|Assault on [[Gaza]], 1917]]
384 | The Ottoman forces were holding a rough line from the fort at [[Gaza]], on the shore of the [[Mediterranean Sea]], to the town of [[Beersheba]], which was the terminus of the Ottoman railway that extended north to Damascus. The British commander in the field, Dobell, chose to attack Gaza, using a short hook move on 26 March, 1917.
385 |
386 | ===First Battle of Gaza===
387 | {{Main|First Battle of Gaza}}
388 | The British attack was essentially a failure. Due to miscommunication, some units retreated when they should have held onto their gains and so the fortress was not taken.
389 |
390 | The government in London believed the reports from the field which indicated a substantial victory had been won and ordered General Murray to move on and capture [[Jerusalem]]. The British were in no position to attack Jerusalem as they had yet to break through the Ottoman defensive positions. These positions were rapidly improved and credit for the Turkish defence is given to the German chief-of-staff [[Friedrich Freiherr Kress von Kressenstein|Baron Kress von Kressenstein]].
391 |
392 | ===Second Battle of Gaza===
393 | {{Main|Second Battle of Gaza}}
394 | A second attack on the fort of Gaza was launched one month later on 17 April, 1917. This attack, supported by naval gunfire, chlorine gas and even a few early [[Mark I (tank)|tanks]], was also a failure. It was essentially a frontal assault on a fortified position, and its failure was due more to inflexibility in operations than to faults in planning; yet it cost some 6,000 British casualties. As a result both General Dobell and General Murray were removed from command. The new man put in charge was General Sir [[Edmund Allenby]] and his orders were clear: take [[Jerusalem]] by Christmas.
395 |
396 | After personally reviewing the Ottoman defensive positions, Allenby requested reinforcements: three more infantry divisions, aircraft, and artillery. This request was granted and by October, 1917, the British were ready for their next attack.
397 |
398 | The Ottoman army had three active fronts at this time: [[Mesopotamian Campaign|Mesopotamia]], Arabia, and the Gaza front. They also had substantial forces deployed around [[Constantinople]] and in the (now quiet) Caucasus front. Given all these demands, the army in Gaza was only about 35,000 strong, led by the Ottoman General [[Kustafa]] and concentrated in three main defensive locations: Gaza, Tel Es Sheria, and Beersheba. Allenby's army was now much larger, with some 88,000 troops in good condition and well-equipped.
399 |
400 | ===Battle of El Buggar Ridge===
401 | {{Main|Battle of El Buggar Ridge}}
402 | The occupation of Karm by the Allies on 22 October, 1917 created a major point for supply and water for the troops in the immediate area. For the Ottoman forces, the establishment of a railway station at Karm placed the defensive positions known as the Hureira Redoubt and Rushdie System which formed a powerful bulwark against any Allied action under threat.
403 |
404 | To forestall this threat, General Erich von Falkenhayn, the Commander of the Yildirim Group, proposed a two phase attack. The plan called for a reconnaissance in force from Beersheba on 27 October, to be followed by an all out attack launched by the 8th Army from Hureira. This second phase was ironically scheduled to occur on the morning of 31 October, 1917, the day when the Battle of Beersheba began.
405 |
406 | ===Battle of Beersheba===
407 | {{Main|Battle of Beersheba (1917)}}
408 | A key feature of the British plan was to convince the Turks (and their German leaders) that once again, Gaza was to be attacked. This deception campaign was extremely thorough and convincing. The [[Battle of El Buggar Ridge]], initiated by the Turks, completed the deception. When the Allies launched their attack on Beersheba, the Turks were taken by surprise. In one of the most remarkable feats of planning and execution, the Allies were able to move some 40,000 men and a similar number of horses over hostile and inhospitable terrain without being detected by the Turks. The climax of the battle was one of the last successful cavalry charges of modern warfare, when two Australian Light Horse regiments (4th and 12th) charged across open ground just before dusk and captured the town.
409 |
410 | The Turkish defeat at Beersheba on 31 October was not a complete rout. The Turks retreated into the hills and prepared defensive positions to the north of Beersheba. For the Allies, the following days were spent fighting a difficult and bloody battle at Tel el Khuweilifeh, to the north east of Beersheba.
411 |
412 | [[Image:Palestine-WW1-2.jpg|thumb|300px|right|Allenby's Offensive, November-December 1917]]
413 | To break through the Turkish defensive line, the Allied forces attacked the Ottoman positions at Tel Es Sheria on 6 November, and followed this up with a further attack at Huj the following day. With the imminent collapse of Gaza at the same time, the Turks quickly retreated to a new line of defence.
414 |
415 | ===Third Battle of Gaza===
416 | {{Main|Third Battle of Gaza}}
417 | On 7 November, the British attacked Gaza for the third time. The Turks, worried about being cut off, retreated in the face of the British assault. Gaza had finally been captured.
418 |
419 | The Turkish defensive position was shattered, the Ottoman army was retreating in some disarray, and General Allenby ordered his army to pursue the enemy. The British followed closely on the heels of the retreating Ottoman forces. An attempt by the Turks to form a defence of a place called Junction Station (Wadi Sarar) was foiled by a British attack on 13 November. General Falkenhayn next tried to form a new defensive line from [[Bethlehem]] to Jerusalem to [[Jaffa]]. The first British attack on Jerusalem failed but with a short rest and the gathering of more infantry divisions, Allenby tried again and on 9 December, 1917, Jerusalem was captured. This was a major political event for the British government of David Lloyd George, one of the few real successes the British could point to after three long bloody years of war.
420 |
421 | On the Turkish side, this defeat marked the exit of Djemal Pasha, who returned to [[Istanbul]]. Djemal had delegated the actual command of his army to German officers such as von Kressenstein and von Falkenhayn more than a year earlier, but now, defeated as [[Enver Pasha]] had been at the [[Battle of Sarikamis]], he gave up even nominal command and returned to the capital. Less than a year remained before he was forced out of the government. General Falkenhayn was also replaced, in March 1918.
422 |
423 | == The final year: Palestine and Syria ==
424 | [[Image:Palestine-WW1-3.jpg|thumb|230px|left|Allenby's Final Attack, September 1918]]
425 | The British government had hopes that the Ottoman Empire could be defeated early in the coming year with successful campaigns in Palestine and Mesopotamia but the [[Spring Offensive]] by the Germans on the Western Front delayed the expected attack on Syria for nine full months. General Allenby's army was largely redeployed to France and most of his divisions were rebuilt with units recently recruited in India. His forces spent much of the summer of 1918 training and reorganising.
426 |
427 | Because the British achieved complete control of the air with their new [[Sopwith Camel|fighter planes]], the Turks, and their new German commander, General [[Otto Liman von Sanders|Liman von Sanders]], had no clear idea where the British were going to attack. Compounding the problems, the Turks, at the direction of their [[War Minister]] [[Enver Pasha]] withdrew their best troops during the summer for the creation of Enver's [[Ottoman Army of Islam|Army of Islam]], leaving behind poor quality, dispirited soldiers. During this time, the Turks were distracted by raids against their open desert (eastern) flank by forces of the Arab Revolt commanded by the [[Faisal I of Iraq|Emir Feisal]] and coordinated by [[T. E. Lawrence]] and other British liaison officers, which tied down thousands of soldiers in garrisons throughout Palestine, [[Jordan]], and Syria.
428 |
429 | ===Battle of Megiddo===
430 | {{Main|Battle of Megiddo (1918)}}
431 | General Allenby finally launched his long-delayed attack on 19 September, 1918. The campaign has been called the Battle of Megiddo (which is a transliteration of the Hebrew name of an ancient town known in the west as [[Armageddon]]). Again, the British made major efforts to deceive the Turks as to their actual intended target of operations. This effort was, again, successful and the Turks were taken by surprise when the British attacked Meggido in a sudden storm. The Turkish troops started a full scale retreat, the British bombed the fleeing columns of men from the air and within a week, the Turkish army in Palestine had ceased to exist as a military force.
432 |
433 | The ultimate goal of Allenby's and Feisal's armies was [[Damascus]]. Two separate Allied columns marched towards Damascus. The first, composed mainly of Australian and Indian cavalry, approached from Galilee, while the other column, consisting of Indian cavalry and the ''ad hoc'' militia following T.E. Lawrence, travelled northwards along the [[Hejaz Railway]]. Australian Light Horse troops marched unopposed into Damascus on 1 October, 1918, despite the presence of some 12,000 Turkish soldiers at Baramke Barracks. Major Olden of the Australian 10th Light Horse Regiment received the Official Surrender of the City at 7 am at the Serai. Later that day, Lawrence's irregulars entered Damascus to claim full credit for its capture.
434 |
435 | The war in Palestine was over but in Syria lasted for a further month. The Turkish government was quite prepared to sacrifice these non-Turkish provinces without surrendering. Indeed, while this battle was raging, the Turks sent an expeditionary force into Russia to enlarge the ethnic Turkish elements of the empire. It was only after the surrender of Bulgaria, which put Turkey into a vulnerable position for invasion, that the Turkish government was compelled to sign an armistice on 30 October, 1918, and surrendered outright two days later. Six hundred years of Ottoman rule over the [[Middle East]] had come to an end.
436 |
437 | == In popular media ==
438 | This campaign has been depicted in several films. The most famous is ''[[Lawrence of Arabia (film)|Lawrence of Arabia]]'' (1962), though it focused primarily on T.E. Lawrence and the Arab Revolt. Other films dealing with this topic include ''[[Forty Thousand Horsemen]]'' (1941), and ''[[The Lighthorsemen (film)|The Lighthorsemen]]'' (1987), with [[Peter Phelps]] and [[Nick Waters]], both of which focused on the role of the ANZAC forces during the campaign.
439 |
440 | ==Summary==
441 | The British suffered a total of 550,000 casualties: more than 90% of these were not battle losses but instead attributable to disease, heat and other secondary causes. Total Turkish losses are unknown but almost certainly larger: an entire army was lost in the fighting and the Turks poured a vast number of troops into the front over the three years of combat.
442 |
443 | Despite the uncertainty of casualty counts, the historical consequences of this campaign are hard to overestimate. The British conquest of Palestine led directly to the [[British mandate]] over Palestine and the [[Trans-Jordan]] which, in turn, paved the way for the creation of the states of [[Israel]], [[Jordan]], [[Lebanon]], and [[Syria]].
444 |
445 | ==References==
446 | {{Reflist}}
447 |
448 | ==See also==
449 | {{portal|World War I}}
450 | *[[Bund der Asienkämpfer]]
451 | {{Commonscat-inline|Sinai and Palestine Campaign}}
452 |
453 | ==External links==
454 | * First World War.com. [http://www.firstworldwar.com/battles/suez.htm Defence of the Suez Canal, 1915]. Retrieved 19 December, 2005.
455 | * [http://alh-research.tripod.com/Light_Horse/ Australian Light Horse Studies Centre]
456 | * [http://www.turkeyswar.com/campaigns/palestine1.htm Palestine pages of 'Turkey in WW1' web site]
457 | * [http://www.nzhistory.net.nz/node/13507 Sinai campaign (NZHistory.net.nz)]
458 | * [http://www.nzhistory.net.nz/node/14256 Palestine campaign (NZHistory.net.nz)]
459 | * [http://www.ottomanpalestine.com/GALLERY_1.htm The Photographs of Palestine Campaign]
460 | * [http://hdl.loc.gov/loc.pnp/ppmsca.13709 Library of Congress's American Colony in Jerusalem's Photo Album]
461 |
462 | ==Sources==
463 | * Battles Nomenclature Committee, Army. ''The Official Names of the Battles and Other Engagements Fought by the Military Forces of the British Empire during the Great War, 1914-1919, and the third Afghan War, 1919: Report of the Battles Nomenclature Committee as Approved by The Army Council Presented to Parliament by Command of His Majesty'' (London, 1922).
464 | * Jean Bou, ''A History of Australia's Mounted Arm'' Series: Australian Army History Series (Port Melbourne: Cambridge University Press, 2009).
465 | * Bruce, Anthony (2002). ''The Last Crusade: The Palestinian Campaign in the First World War''. John Murray.
466 | * Field Marshal Lord Carver, ''The National Army Museum Book of The Turkish Front 1914-1918 The Campaigns at Gallipoli, in Mesopotamia and in Palestine'' (London: Pan Macmillan, 2003).
467 | * R. M. Downes, ''The Campaign in Sinai and Palestine'' Part II in Volume 1 ''Gallipoli, Palestine and New Guinea'' of A. G. Butler, ''Official History of the Australian Army Medical Services, 1914–1918'' (2nd edition 1938) p. 553. On line at Australian War Memorial; Official Histories.
468 | * Erickson, Edward J., ''Ordered to Die A History of the Ottoman Army in the First World War'' Forward by General Hüseyiln Kivrikoglu Contributions in Military Studies, No. 201 (Westport Connecticut: Greenwood Press, 2001).
469 | * Esposito, Vincent (ed.) (1959). ''The West Point Atlas of American Wars - Vol. 2''. Frederick Praeger Press.
470 | * Fromkin, David (1989). ''A Peace to End All Peace''. Avon Books.
471 | * Grainger, John D. (2006) ''The Battle for Palestine: 1917'' Boydell Press. ISBN 1 84383 263 1
472 | * Keegan, John (1998). ''The First World War''. Random House Press.
473 | * E.G. Keogh, ''Suez to Aleppo'' (Melbourne: Directorate of Military Training, 1955).
474 | * Preston, Lieutenant-Colonel Richard Martin (1921) ''The Desert Mounted Corps: An Account of the Cavalry Operations in Palestine and Syria 1914 to 1918''. Houghton Mifflin Company. [http://books.google.com/books?id=LHg5xNCFDGsC Google Books Search]
475 | * Powles, C. Guy, ''The New Zealanders in Sinai and Palestine'' Volume III ''Official History New Zealand's Effort in the Great War'' (Auckland, Christchurch, Dunedin and Wellington: Whitcombe & Tombs Ltd, 1922).
476 | * War Diaries of 1st, 2nd and 3rd Light Horse Brigades. [available on the Australian War Memorial's web site]
477 | * Field Marshal Earl Wavell, ''The Palestine Campaigns'' 3rd Edition thirteenth Printing; Series: A Short History of the British Army 4th Edition by Major E.W. Sheppard (London: Constable & Co. 1968).
478 | * Woodward, David R (2006). ''Forgotten Soldiers of the First World War - Lost Voices from the Middle Eastern Front''. Tempus Publishing.
479 |
480 | {{World War I}}
481 |
482 | [[Category:Ottoman Empire and World War I]]
483 | [[Category:Middle Eastern theatre of World War I| ]]
484 | [[Category:Campaigns and theatres of World War I|Sinai and Palestine]]
485 | [[Category:Military campaigns and theatres of World War I involving Australia]]
486 |
487 | [[es:Campaña del Sinaí y Palestina]]
488 | [[he:המערכה על סיני וארץ ישראל במלחמת העולם הראשונה]]
489 | [[hu:Palesztin front (első világháború)]]
490 | [[pt:Campanha do Sinai e Palestina]]
491 | [[ru:Синайско-Палестинская кампания]]
492 | [[sr:Синајски и палестински поход]]
493 | [[tr:Sina ve Filistin Cephesi]]
494 |
495 |
496 |
--------------------------------------------------------------------------------
/python/revision_differ.py:
--------------------------------------------------------------------------------
1 | #!/usr/local/bin/pypy
2 | ################################################################################
3 | # Revision Differ
4 | #
5 | # This script was written to be a streaming mapper for wikihadoop
6 | # (see https://github.com/whym/wikihadoop). By default, this script runs under
7 | # pypy (much faster), but it can also be run under CPython 2.7+.
8 | #
9 | # Required to run this script are
10 | # - diff_match_patch.py (provided)
11 | # - xml_simulator.py (provided)
12 | # - wikimedia-utilities (https://bitbucket.org/halfak/wikimedia-utilities)
13 | #
14 | # Author: Aaron Halfaker (aaron.halfaker@gmail.com)
15 | #
16 | # This software licensed as GPLv2(http://www.gnu.org/licenses/gpl-2.0.html). and
17 | # is provided WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
18 | # implied.
19 | #
20 | ################################################################################
21 | import logging, traceback, sys, re
22 | from StringIO import StringIO
23 |
24 | from diff_match_patch import diff_match_patch
25 |
26 | from xml_simulator import RecordingFileWrapper
27 | from wmf.dump.iterator import Iterator
28 | import wmf
29 |
30 | def tokenize(content):
31 | return re.findall(
32 | r"[\w]+" + #Word
33 | r"|\[\[" + #Opening internal link
34 | r"|\]\]" + #Closing internal link
35 | r"|\{\{" + #Opening template
36 | r"|\}\}" + #Closing template
37 | r"|\{\{\{" + #Opening template var
38 | r"|\}\}\}" + #Closing template var
39 | r"|\n+" + #Line breaks
40 | r"| +" + #Spaces
41 | r"|&\w+;" + #HTML escape sequence
42 | r"|'''" + #Bold
43 | r"|''" + #Italics
44 | r"|=+" + #Header
45 | r"|\{\|" + #Opening table
46 | r"|\|\}" + #Closing table
47 | r"|\|\-" + #Table row
48 | r"|.", #Misc character
49 | content
50 | )
51 |
52 | def hashTokens(tokens, hash2Token=[], token2Hash={}):
53 | hashBuffer = StringIO()
54 | for t in tokens:
55 | if t in token2Hash:
56 | hashBuffer.write(unichr(token2Hash[t]+1))
57 | else:
58 | hashId = len(hash2Token)
59 | hash2Token.append(t)
60 | token2Hash[t] = hashId
61 | hashBuffer.write(unichr(hashId+1))
62 |
63 | return (hashBuffer.getvalue(), hash2Token, token2Hash)
64 |
65 | def unhash(hashes, hash2Token, sep=''):
66 | return sep.join(hash2Token[ord(h)-1] for h in hashes)
67 |
68 | def simpleDiff(content1, content2, tokenize=tokenize, sep='', report=[-1,0,1]):
69 | hashes1, h2t, t2h = hashTokens(tokenize(content1))
70 | hashes2, h2t, t2h = hashTokens(tokenize(content2), h2t, t2h)
71 |
72 | report = set(report)
73 |
74 | dmp = diff_match_patch()
75 |
76 | diffs = dmp.diff_main(hashes1, hashes2, checklines=False)
77 |
78 | position = 0
79 | for (ar,hashes) in diffs:
80 | content = unhash(hashes,h2t,sep=sep)
81 | if ar in report:
82 | yield position, ar, content
83 |
84 | if ar != -1: position += len(content)
85 |
86 |
87 | metaXML = """
88 |
89 |
90 | Wikipedia
91 | http://en.wikipedia.org/wiki/Main_Page
92 | MediaWiki 1.17wmf1
93 | first-letter
94 |
95 | Media
96 | Special
97 |
98 | Talk
99 | User
100 | User talk
101 | Wikipedia
102 | Wikipedia talk
103 | File
104 | File talk
105 | MediaWiki
106 | MediaWiki talk
107 | Template
108 | Template talk
109 | Help
110 | Help talk
111 | Category
112 | Category talk
113 | Portal
114 | Portal talk
115 | Book
116 | Book talk
117 |
118 |
119 | """
120 |
121 |
122 | xmlSim = RecordingFileWrapper(sys.stdin, pre=metaXML, post='')
123 |
124 | try:
125 | dump = Iterator(xmlSim)
126 | except Exception as e:
127 | sys.stderr.write(str(e) + xmlSim.getHistory())
128 | sys.exit(1)
129 |
130 |
131 | for page in dump.readPages():
132 | sys.stderr.write('Processing: %s - %s\n' % (page.getId(), page.getTitle().encode('UTF-8')))
133 | try:
134 | lastRev = None
135 | currRevId = None
136 | for revision in page.readRevisions():
137 | currRevId = revision.getId()
138 | if lastRev == None:
139 | lastRev = revision
140 | else:
141 | namespace, title = wmf.normalizeTitle(page.getTitle(), namespaces=dump.namespaces)
142 | nsId = dump.namespaces[namespace]
143 | if revision.getContributor() != None:
144 | userId = revision.getContributor().getId()
145 | userName = revision.getContributor().getUsername()
146 | else:
147 | userId = None
148 | userName = None
149 |
150 | row = [
151 | repr(revision.getId()),
152 | repr(page.getId()),
153 | repr(nsId),
154 | repr(title),
155 | repr(revision.getTimestamp()),
156 | repr(revision.getComment()),
157 | repr(revision.getMinor()),
158 | repr(userId),
159 | repr(userName)
160 | ]
161 | try:
162 | for d in simpleDiff(lastRev.getText(), revision.getText(), report=[-1,1]):
163 | row.append(":".join(repr(v) for v in d))
164 |
165 | print("\t".join(row))
166 | sys.stderr.write('reporter:counter:SkippingTaskCounters,MapProcessedRecords,1\n')
167 | except Exception as e:
168 | row.extend(["diff_fail", str(e).encode('string-escape')])
169 | print("\t".join(row))
170 | raise e
171 |
172 |
173 | except Exception as e:
174 | sys.stderr.write('%s - while processing revId=%s\n' % (e, currRevId))
175 | traceback.print_exc(file=sys.stderr)
176 |
--------------------------------------------------------------------------------
/python/xml_simulator.py:
--------------------------------------------------------------------------------
1 | import sys
2 | from StringIO import StringIO
3 | from collections import deque
4 |
5 | class FileWrapper:
6 |
7 | def __init__(self, fp, pre='', post=''):
8 | self.fp = fp
9 | self.pre = StringIO(pre)
10 | self.post = StringIO(post)
11 | self.closed = False
12 | self.mode = "r"
13 |
14 | def read(self, bytes=sys.maxint):
15 | bytes = int(bytes)
16 | if self.closed: raise ValueError("I/O operation on closed file")
17 |
18 | preBytes = self.pre.read(bytes)
19 | if len(preBytes) < bytes:
20 | fpBytes = self.fp.read(bytes-len(preBytes))
21 | else:
22 | fpBytes = ''
23 |
24 | if len(preBytes) + len(fpBytes) < bytes:
25 | postBytes = self.post.read(bytes-(len(preBytes) + len(fpBytes)))
26 | else:
27 | postBytes = ''
28 |
29 | return preBytes + fpBytes + postBytes
30 |
31 | def readline(self):
32 | if self.closed: raise ValueError("I/O operation on closed file")
33 |
34 | output = self.pre.readline()
35 | if len(output) == 0 or output[-1] != "\n":
36 | output += self.fp.readline()
37 | if len(output) == 0 or output[-1] != "\n":
38 | output += self.post.readline()
39 |
40 | return output
41 |
42 | def readlines(self): raise NotImplementedError()
43 |
44 | def __iter__(self):
45 |
46 | line = self.readline()
47 | while line != '':
48 | yield line
49 | line = self.readline()
50 |
51 |
52 | def seek(self): raise NotImplementedError()
53 | def write(self): raise NotImplementedError()
54 | def writelines(self): raise NotImplementedError()
55 | def tell(self):
56 | return self.pre.tell() + self.fp.tell() + self.post.tell()
57 |
58 |
59 | def close(self):
60 | self.closed = True
61 | self.fp.close()
62 |
63 | class RecordingFileWrapper(FileWrapper):
64 |
65 | def __init__(self, fp, pre='', post='', record=10000):
66 | self.history = deque(maxlen=record)
67 | FileWrapper.__init__(self, fp, pre=pre, post=post)
68 |
69 | def read(self, bytes=sys.maxint):
70 | outBytes = FileWrapper.read(self, bytes)
71 | self.history.extend(outBytes)
72 | return outBytes
73 |
74 | def readline(self):
75 | outBytes = FileWrapper.readline(self)
76 | self.history.extend(outBytes)
77 | return outBytes
78 |
79 | def getHistory(self):
80 | return ''.join(self.history)
81 |
--------------------------------------------------------------------------------
/src/main/java/org/wikimedia/wikihadoop/ByteMatcher.java:
--------------------------------------------------------------------------------
1 | /**
2 | * Copyright 2011 Yusuke Matsubara
3 | *
4 | * Licensed under the Apache License, Version 2.0 (the "License");
5 | * you may not use this file except in compliance with the License.
6 | * You may obtain a copy of the License at
7 | *
8 | * http://www.apache.org/licenses/LICENSE-2.0
9 | *
10 | * Unless required by applicable law or agreed to in writing, software
11 | * distributed under the License is distributed on an "AS IS" BASIS,
12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | * See the License for the specific language governing permissions and
14 | * limitations under the License.
15 | */
16 |
17 | package org.wikimedia.wikihadoop;
18 |
19 | import java.io.*;
20 |
21 | import org.apache.hadoop.io.DataOutputBuffer;
22 | import org.apache.hadoop.fs.Seekable;
23 |
24 | public class ByteMatcher {
25 | private final InputStream in;
26 | private final Seekable pos;
27 | private long lastPos;
28 | private long currentPos;
29 | private long bytes;
30 | public ByteMatcher(InputStream in, Seekable pos) throws IOException {
31 | this.in = in;
32 | this.pos = pos;
33 | this.bytes = 0;
34 | this.lastPos = -1;
35 | this.currentPos = -1;
36 | }
37 | public ByteMatcher(SeekableInputStream is) throws IOException {
38 | this(is, is);
39 | }
40 | public long getReadBytes() {
41 | return this.bytes;
42 | }
43 | public long getPos() throws IOException {
44 | return this.pos.getPos();
45 | }
46 | public long getLastUnmatchPos() { return this.lastPos; }
47 |
48 | public void skip(long len) throws IOException {
49 | this.in.skip(len);
50 | this.bytes += len;
51 | }
52 |
53 | boolean readUntilMatch(String textPat, DataOutputBuffer outBufOrNull, long end) throws IOException {
54 | byte[] match = textPat.getBytes("UTF-8");
55 | int i = 0;
56 | while (true) {
57 | int b = this.in.read();
58 | // end of file:
59 | if (b == -1) {
60 | System.err.println("eof 1");
61 | return false;
62 | }
63 | ++this.bytes; //! TODO: count up later in batch
64 | // save to buffer:
65 | if (outBufOrNull != null)
66 | outBufOrNull.write(b);
67 |
68 | // check if we're matching:
69 | if (b == match[i]) {
70 | i++;
71 | if (i >= match.length)
72 | return true;
73 | } else {
74 | i = 0;
75 | if ( this.currentPos != this.getPos() ) {
76 | this.lastPos = this.currentPos;
77 | this.currentPos = this.getPos();
78 | }
79 | }
80 | // see if we've passed the stop point:
81 | if (i == 0 && this.pos.getPos() >= end) {
82 | System.err.println("eof 2: end=" + end);
83 | return false;
84 | }
85 | }
86 | }
87 | }
88 |
--------------------------------------------------------------------------------
/src/main/java/org/wikimedia/wikihadoop/SeekableInputStream.java:
--------------------------------------------------------------------------------
1 | /**
2 | * Copyright 2011 Yusuke Matsubara
3 | *
4 | * Licensed under the Apache License, Version 2.0 (the "License");
5 | * you may not use this file except in compliance with the License.
6 | * You may obtain a copy of the License at
7 | *
8 | * http://www.apache.org/licenses/LICENSE-2.0
9 | *
10 | * Unless required by applicable law or agreed to in writing, software
11 | * distributed under the License is distributed on an "AS IS" BASIS,
12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | * See the License for the specific language governing permissions and
14 | * limitations under the License.
15 | */
16 |
17 | package org.wikimedia.wikihadoop;
18 |
19 | import java.io.*;
20 |
21 | import org.apache.hadoop.fs.FileSystem;
22 | import org.apache.hadoop.fs.Path;
23 | import org.apache.hadoop.fs.FSDataInputStream;
24 | import org.apache.hadoop.fs.Seekable;
25 | import org.apache.hadoop.mapred.FileSplit;
26 | import org.apache.hadoop.mapred.*;
27 | import org.apache.hadoop.io.compress.*;
28 |
29 | public class SeekableInputStream extends FilterInputStream implements Seekable {
30 | private final Seekable seek;
31 | private final SplitCompressionInputStream sin;
32 | public SeekableInputStream(FSDataInputStream in) {
33 | super(in);
34 | this.seek = in;
35 | this.sin = null;
36 | }
37 | public SeekableInputStream(SplitCompressionInputStream cin) {
38 | super(cin);
39 | this.seek = cin;
40 | this.sin = cin;
41 | }
42 | public SeekableInputStream(CompressionInputStream cin, FSDataInputStream in) {
43 | super(cin);
44 | this.seek = in;
45 | this.sin = null;
46 | }
47 | public static SeekableInputStream getInstance(Path path, long start, long end, FileSystem fs, CompressionCodecFactory compressionCodecs) throws IOException {
48 | CompressionCodec codec = compressionCodecs.getCodec(path);
49 | FSDataInputStream din = fs.open(path);
50 | if (codec != null) {
51 | Decompressor decompressor = CodecPool.getDecompressor(codec);
52 | if (codec instanceof SplittableCompressionCodec) {
53 | SplittableCompressionCodec scodec = (SplittableCompressionCodec)codec;
54 | SplitCompressionInputStream cin = scodec.createInputStream
55 | (din, decompressor, start, end,
56 | SplittableCompressionCodec.READ_MODE.BYBLOCK);
57 | return new SeekableInputStream(cin);
58 | } else {
59 | // non-splittable compression input stream
60 | // no seeking or offsetting is needed
61 | assert start == 0;
62 | CompressionInputStream cin = codec.createInputStream(din, decompressor);
63 | return new SeekableInputStream(cin, din);
64 | }
65 | } else {
66 | // non compression input stream
67 | // we seek to the start of the split
68 | din.seek(start);
69 | return new SeekableInputStream(din);
70 | }
71 | }
72 | public static SeekableInputStream getInstance(FileSplit split, FileSystem fs, CompressionCodecFactory compressionCodecs) throws IOException {
73 | return getInstance(split.getPath(), split.getStart(), split.getStart() + split.getLength(), fs, compressionCodecs);
74 | }
75 | public SplitCompressionInputStream getSplitCompressionInputStream() { return this.sin; }
76 | public long getPos() throws IOException { return this.seek.getPos(); }
77 | public void seek(long pos) throws IOException { this.seek.seek(pos); }
78 | public boolean seekToNewSource(long targetPos) throws IOException { return this.seek.seekToNewSource(targetPos); }
79 | @Override public String toString() {
80 | return this.in.toString();
81 | }
82 | }
83 |
--------------------------------------------------------------------------------
/src/main/java/org/wikimedia/wikihadoop/StreamWikiDumpInputFormat.java:
--------------------------------------------------------------------------------
1 | /**
2 | * Copyright 2011 Yusuke Matsubara
3 | *
4 | * Licensed under the Apache License, Version 2.0 (the "License");
5 | * you may not use this file except in compliance with the License.
6 | * You may obtain a copy of the License at
7 | *
8 | * http://www.apache.org/licenses/LICENSE-2.0
9 | *
10 | * Unless required by applicable law or agreed to in writing, software
11 | * distributed under the License is distributed on an "AS IS" BASIS,
12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | * See the License for the specific language governing permissions and
14 | * limitations under the License.
15 | */
16 |
17 | package org.wikimedia.wikihadoop;
18 |
19 | import java.io.*;
20 | import java.util.*;
21 |
22 | import org.apache.hadoop.io.DataOutputBuffer;
23 | import org.apache.hadoop.io.Writable;
24 | import org.apache.hadoop.io.Text;
25 | import org.apache.hadoop.io.WritableComparable;
26 | import org.apache.hadoop.fs.FileSystem;
27 | import org.apache.hadoop.fs.Path;
28 | import org.apache.hadoop.fs.FSDataInputStream;
29 | import org.apache.hadoop.io.Text;
30 | import org.apache.hadoop.fs.Seekable;
31 | import org.apache.hadoop.fs.FileStatus;
32 | import org.apache.hadoop.fs.BlockLocation;
33 | import org.apache.hadoop.net.NetworkTopology;
34 | import org.apache.hadoop.mapred.Reporter;
35 | import org.apache.hadoop.mapred.FileSplit;
36 | import org.apache.hadoop.mapred.JobConf;
37 | import org.apache.hadoop.mapred.RecordReader;
38 | import org.apache.hadoop.mapred.*;
39 | import org.apache.hadoop.io.compress.*;
40 | import java.util.regex.*;
41 |
42 | /** A InputFormat implementation that splits a Wikimedia Dump File into page fragments, and emits them as input records.
43 | * The record reader embedded in this input format converts a page into a sequence of page-like elements, each of which contains two consecutive revisions. Output is given as keys with empty values.
44 | *
45 | * For example, Given the following input containing two pages and four revisions,
46 | *