299 |
300 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraper Workshop
2 | Welcome to the workshop where you are going to build a web scraper that is going to use:
3 |
4 | - Platform threads
5 | - Virtual threads
6 | - Structured task scope
7 | - Scoped values
8 |
9 | The workshop starts with a simple single-threaded web scraper that only scrapes a single page. You are going to improve this web scraper by first making it multithreaded using platform threads and later by using virtual threads. This new type of thread is a great
10 | new addition to the Java language but doesn't work the same as platform threads in every situation. During this workshop you are going to experience when virtual threads work best, and when they work just oke-ish.
11 |
12 | To follow along with the workshop you need to also check out [this repository](https://github.com/davidtos/workshop_server).
13 | The repo contains a Spring application that is going to act as the web server that you are going to scrape.
14 |
15 | ## Contact info
16 | If you have any questions after the workshop or didn't attend the workshop but still want to ask me something
17 | you can find my contact info here https://davidvlijmincx.com/about/
18 |
19 | i am happy to answer any Java related question :)
20 |
21 |
22 | ## How to follow along with the workshop
23 | Below you will find the steps of the workshop. The best way to follow along is to start with step 1 and to keep developing inside this branch. If you want to start every step of the workshop with a clean branch you can also check out the branch belonging to that step. Each step has a branch inside this git repo with the same name. If you are falling behind please say so, then we can adjust the speed of the workshop :-) or you can check out at the branch of the next step.
24 |
25 | # TL;DR
26 | - Let's build a web scraper!
27 | - Run the Spring project inside [this repository](https://github.com/davidtos/workshop_server), it has the web server you will scrape inside it
28 | - Follow along with the steps below
29 | - Are you already done and want to start with the next step? go head! :-)
30 | - Any questions? Feel free to ask! I more than happy to answer them
31 |
32 | # Requirements
33 | To follow along with this workshop you need the following things:
34 |
35 | - Java 21 or higher
36 | - Check out and run the project in [this repository](https://github.com/davidtos/workshop_server)
37 | - Check out this repository if you haven't done so already
38 |
39 | # The steps of the workshop:
40 | Just follow along with the following steps. If you have any questions feel free to ask Ron and I are there to answer them.
41 | We will you give some needed information between steps, so you can focus on solving one type of problem at a time. :-)
42 |
43 | ## (Step 1) - check out the code
44 | You need to check out these two repositories:
45 |
46 | ### The scraper basis https://github.com/davidtos/virtual_thread_workshop
47 | This is the repository you are looking at right now. It contains all the steps/branches and starting information that you will need to build the scraper.
48 |
49 | ### The web server https://github.com/davidtos/workshop_server.
50 | This is the webserver that the scraper is going to scrape. The repository contains a Spring boot application that you can run in the background while you build the Scraper. You don't have to make any changes to this project.
51 |
52 | When you have both projects checked out and the Spring boot application running you can verify if everything works as it should. To check that everything works you can run the WebScraper class from this repository; it should scrape a single page.
53 |
54 | ## (Step 2) - Add platform threads
55 | Check out the following branch "step-1-2-add-platform-threads" if you haven't done so already (it is the default branch) . This branch is the basis of the web scraper.
56 | you can already run it, and it will scrape a single page from the web server/ Spring boot application.
57 |
58 | The first step is to make the **Scrape** class run concurrently using platform threads. The goal is to be able to create any number of
59 | Scrape instances that each scrape a single page.
60 |
61 |
62 | Hint 1
63 | One way to achieve this is by using the Executors services.
64 |
65 |
66 |
67 | Hint 2
68 | To make the scrape class into something that is easily run by a thread. You can make it so that the Scrape class
69 | implements the Runnable interface. You can then either rename the scrape method to Run or create a new Run method.
70 |
71 | Done this, you can pass a new Scrape instance to a Thread.
72 |
73 |
74 |
75 | ## (Step 3) - Start using virtual threads
76 | You can now scrape webpages using multiple Scrape instances that each run on a Platform Thread. The next step is to change it in such a way that it uses Virtual threads instead. To do this you can use the Thread class or an Executor.
77 |
78 | Before you make the change take a good look at the performance so you can make a fair comparison between Virtual threads and Platform threads.
79 |
80 | > Make it easy to switch between Virtual and Platform threads, so you can switch between the two to see the difference in performance.
81 | > Doesn't need to be anything fancy commenting out a line of code is fine.
82 |
83 | ## (Step 4) - Difference between virtual and platform threads
84 | The scraper is now able to run on either Virtual Threads or Platform threads. To see the impact these Threads have on the Scraper
85 | you can play with the following two variables:
86 |
87 | 1. The URL of the scraper
88 | 2. The number of threads/tasks you create.
89 |
90 | The URLs you can use are:
91 | - http://localhost:8080/v1/crawl/delay/330/57
92 | - http://localhost:8080/v1/crawl/330/57
93 |
94 | The first endpoint has a delay between 10 and 200 milliseconds, forcing the threads to block and wait for several milliseconds.
95 | The URL without the delay returns immediately without the extra waiting;
96 | meaning that the responses from this endpoint are very quick and the thread is not blocked for very long.
97 |
98 | The second thing you can change is the number of scrape tasks you start. Try out the difference it makes when you submit
99 | 200 or 400 or even a 1000+ scrape task to a pool of platform threads or in the case of virtual threads create as many virtual threads as you have jobs.
100 |
101 | Some of the results will surprise you :-)
102 |
103 | > Note: To get a good idea of the impact. I recommend trying out lots of tasks with the delay endpoint on both virtual and platform threads. And the same thing but without the endpoint with the delay.
104 |
105 | > **_!!Bonus!!_** Let's enable virtual thread on the (Spring) back-end, Spring now support virtual threads. Let's enable them
106 | > to see how the scraper is impacted and what it does with the back-end application.
107 |
108 | ## (Step 5) - Find the pinned virtual thread
109 | Virtual threads are unmounted when they are blocked for example, when they are waiting on the response of a web server. Unmounting is a powerful feature but doesn't always work (yet)...
110 | When the unmounting doesn't happen we say that the virtual is pinned.
111 | A pinned virtual thread causes not only the virtual thread but also the carrier thread it's running on to be blocked. As you may expect this causes performance issues.
112 |
113 | Now it's up to you to fix the scraper and replace the functionality that causes virtual threads to be pinned.
114 |
115 | To help you find the method causing issues you can use one of the following VM options:
116 | ```text
117 | -Djdk.tracePinnedThreads=short
118 |
119 | -Djdk.tracePinnedThreads=full
120 | ```
121 | Run the web scraper with one of these two options and replace the functionality with one that does not cause the virtual threads
122 | to be pinned. Try them both out and see what the difference is between the both of them and which one helps you the most to fix the issue.
123 |
124 |
125 | Hint 1
126 | Java 9 added an HTTP client that does not block
127 |
128 |
129 |
130 |
131 | Hint 2
132 | If you want to use the new http client you can create one using the following example. It will create a basic client
133 | that will follow redirects and has a timeout of 20 seconds.
134 |
135 | ````java
136 | private static HttpClient createHttpClient() {
137 | return HttpClient.newBuilder()
138 | .version(HttpClient.Version.HTTP_2)
139 | .followRedirects(HttpClient.Redirect.NORMAL)
140 | .connectTimeout(Duration.ofSeconds(20))
141 | .build();
142 | }
143 | ````
144 |
145 | Using the client can be done as follows. This method takes an url and passes it to the client and returns the body.
146 | ````java
147 | private String getBody(String url) throws IOException, InterruptedException {
148 | HttpRequest request = HttpRequest.newBuilder().GET().uri(URI.create(url)).build();
149 | HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString());
150 | return response.body();
151 | }
152 | ````
153 |
154 |
155 | ## (Step 6) - Set carrier threads (Improve performance branch)
156 | By default, you get as many carrier threads as there are cores available inside your system. There are two ways to tweak the
157 | number of carrier threads that get created.
158 |
159 | Use the following options and see what impact it has on your scraper.
160 | ```text
161 | -Djdk.virtualThreadScheduler.parallelism=1
162 |
163 | -Djdk.virtualThreadScheduler.maxPoolSize=1
164 | ```
165 |
166 | Try out some different numbers and see if it increases or lowers the amount of pages per second you can scrape.
167 |
168 | > These options are not needed for the following steps.
169 |
170 | ## (Step 7) - Improve performance
171 | The next step is to improve the performance of the scraper. Make it so that the following operations run in their own virtual thread.
172 |
173 | ```java
174 | // operation 1:
175 | visited.add(url);
176 |
177 | // operation 2:
178 | for (Element link : linksOnPage) {
179 | String nextUrl = link.attr("abs:href");
180 | if (nextUrl.contains("http")) {
181 | pageQueue.add(nextUrl);
182 | }
183 | }
184 | ```
185 | Run the Scraper a few times with and without the improvement to see the difference in performance it makes.
186 |
187 | ## (Step 8) - Use StructuredTaskScope
188 | > For this and the following steps it may be necessary to run your application with the `--enable-preview` flag.
189 |
190 | During the previous step, you started two virtual threads inside another virtual thread. This is a great way to run things concurrently, but it creates an implicit relationship between the threads. What should happen when a thread fails? The desired behavior in this case would be all or nothing, either all threads succeed or we do a rollback.
191 |
192 | During this step, we are going to improve the code to make the relationship these threads have more explicit. This helps other
193 | developers to better understand the intent of your code, and enables you to use a powerful way of managing the lifetime of threads.
194 |
195 | For this step rewrite the code from the previous assignment in a way that it uses `StructuredTaskScope.ShutdownOnFailure()` the idea is
196 | to fork new threads using the StructuredTaskScope.
197 |
198 | ## (Step 9) - Implement ShutdownOnSuccess
199 | `ShutdownOnFailure` is not the only shutdown policy that you get with Java 21. During this step, you are going to implement the
200 | `ShutdownOnSuccess` shutdown policy. The **ShutdownOnSuccess** policy states that it will shut down the scope after a threads finishes successfully.
201 |
202 | For the next step, you are going to let another service know what page you just scraped. To improve the speed of the scraper it doesn't matter
203 | which instance processes the request first. The fastest instance to process the request is the winner as far as the scraper is concerned.
204 |
205 | The URLs of the instances are:
206 | - http://localhost:8080/v1/VisitedService/1
207 | - http://localhost:8080/v1/VisitedService/2
208 | - http://localhost:8080/v1/VisitedService/3
209 |
210 | The services expect a POST request with a URL as the body.
211 |
212 | Now it is up to you to implement the ShutdownOnSuccess scope in a way that a new virtual thread is forked for each one of the service instances.
213 |
214 | If you are using the HttpClient you can use the following code to do a POST request to an instance:
215 | ```java
216 | private Object post(String serviceUrl, String url) throws IOException, InterruptedException {
217 | HttpRequest request = HttpRequest.newBuilder().POST(HttpRequest.BodyPublishers.ofString(url)).uri(URI.create(serviceUrl)).build();
218 | client.send(request, HttpResponse.BodyHandlers.ofString());
219 | return null;
220 | }
221 | ```
222 |
223 | ## (Step 10) - Use scoped values
224 | The name of this step already gave it away, but for the last step, you are going to add scoped values to the scraper.
225 | You need to change the Scraper in such a way that each Scraper instance runs in a scope where the HTTP client is already known.
226 |
227 | The goal is to no longer pass the HttpClient as a constructor parameter to the Scraper but you implement it as a ScopedValue. This way the Client
228 | is known inside the scraper and all the subsequent calls.
229 |
230 | > Note: During the implementation notice that the child virtual threads can use the same client as you passed to the parent thread.
231 | > When you use the structured task scope all the threads you fork will have the same scoped values as the parent because they
232 | > run in the same scope as the parent.
233 |
234 | ## Bonus feature 1: Scope inheritance
235 | You may have used scope inheritance in step 10 already, but let's focus on it a bit more with this bonus feature. Scope inheritance
236 | is the mechanism in which a child thread, inherits the scoped values from the parent thread. Essentially both threads keep running in the same scope.
237 |
238 | To show you how inheritance of scoped values works you will implement the `HighScore` class from the `bonus.feature` package.
239 | The challenge is to pass the score to the submitScore method without using it as a method parameter. You will have to use scope values and structured task scopes.
240 |
241 | ## Bonus feature 2: Rebinding scoped values
242 | Since we have access to the code, lets cheat a little with the high score. for this bonus feature you are going to increase the score with 1000 during the validation step.
243 | To do this you will need to rebind the Score scoped value.
244 |
245 | ## Bonus feature 3: Creating your own shutdown task scope
246 | For this feature, you are going to implement our own task scope. Till now, you have only used the built-in task scopes/ shutdown policies. Let's try something else
247 | and implement your own task scope. on the `bonus.features` package you can find the `FindBestStartingSource` class file. The function of this class is to find the best staring point url for the scraper.
248 |
249 | The class file has two classes `FindBestStartingSource` and `CriteriaScope` it is up to you to implement your own structured task scope wit the CriteriaScope class and use it instead of the ShutdownOnFailure scope.
250 |
251 | The goal is to implement a custom scope that stops when it has found a starting point with more than 150 urls.
252 |
253 | > Note: To implement your own scope you need to extend the `StructuredTaskscope` class and override the `handleComplete()` method and call the `shutdown()` when you found a good staring point for the scraper.
254 |
255 | ## Bonus feature 4: custom Structured task scope: moving the business logic
256 | Having all the domain logic inside the scope class is the king of ugly. It's just not a nice place to have business logic.
257 | You can improve this in any way you want, you can be creative :). The only suggestion I will give is to use a `Predicate`.
258 |
259 | > Note: While this doesn't hurt or improve performance it is good to think critical about the code you write. Even if the API somewhat forces you into a solution, it doesn't mean
260 | > you can't create something that is a bit more maintainable and readable. :)
261 |
262 | ## Bonus feature 5: Deadlines with structured concurrency
263 | No one likes to wait forever. So let us add some deadlines to the scope from the previous assignment. Create a deadline for any number
264 | of milliseconds, and look at what it does with your code, and status of the virtual threads.
265 |
266 | > Note: The methods inside the `StartingPointFinder` all do a thread.sleep() call.
267 | > `source1` waits 100
268 | > `source2` waits 200
269 | > `source3` waits 50
270 |
271 | ## Bonus feature 6: Virtual thread with existing executors
272 | Virtual threads are great, but how can you use it with existing code? During this assignment you will implement virtual threads with existing executors. In the bonus feature
273 | package you can find the `VirtualThreadsWithExecutors` class. Currently, it uses platform threads, but it is up to you now to implement it with virtual threads.
274 |
275 | ## bonus feature 7: Limit the number of requests without limiting the virtual thread creation
276 | The importance of virtual threads is that you should use them as threads, but more like task you want to run concurrently.
277 | For the last bonus feature, you are tasked with limiting the number of requests going out the back-end server, without limiting the number
278 | of virtual threads that get created. You can use anything you want, but I would recommend to use a kind of lock :)
279 |
280 | Pooling virtual threads and pinning them in any way does not count.
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 |
5 | 4.0.0
6 |
7 | com.davidvlijmincx
8 | workshop
9 | 1.0-SNAPSHOT
10 |
11 |
12 | 21
13 | 21
14 | UTF-8
15 |
16 |
17 |
18 |
19 | org.jsoup
20 | jsoup
21 | 1.16.1
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 | org.apache.maven.plugins
31 | maven-compiler-plugin
32 | 3.8.0
33 |
34 | 21
35 | --enable-preview
36 |
37 |
38 |
39 |
40 |
41 |
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/WebScraper.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx;
2 |
3 | import org.jsoup.Jsoup;
4 | import org.jsoup.nodes.Document;
5 | import org.jsoup.nodes.Element;
6 | import org.jsoup.select.Elements;
7 |
8 | import java.io.IOException;
9 |
10 | import java.util.Set;
11 | import java.util.concurrent.ConcurrentHashMap;
12 | import java.util.concurrent.LinkedBlockingQueue;
13 |
14 | public class WebScraper {
15 |
16 | public static void main(String[] args) {
17 | final var queue = new LinkedBlockingQueue(2000);
18 | Set visited = ConcurrentHashMap.newKeySet(3000);
19 |
20 | queue.add("http://localhost:8080/v1/crawl/delay/330/57");
21 |
22 | long startTime = System.currentTimeMillis();
23 |
24 | new Scrape(queue, visited).scrape();
25 |
26 | measureTime(startTime, visited);
27 |
28 | }
29 |
30 | private static void measureTime(long startTime, Set visited) {
31 | long endTime = System.currentTimeMillis();
32 | long totalTime = endTime - startTime;
33 |
34 | double totalTimeInSeconds = totalTime / 1000.0;
35 |
36 | System.out.printf("Crawled %s web page(s)", visited.size());
37 | System.out.println("Total execution time: " + totalTime + "ms");
38 |
39 | double throughput = visited.size() / totalTimeInSeconds;
40 | System.out.println("Throughput: " + throughput + " pages/sec");
41 | }
42 |
43 | }
44 |
45 | class Scrape {
46 |
47 | private final LinkedBlockingQueue pageQueue;
48 |
49 | private final Set visited;
50 |
51 | public Scrape(LinkedBlockingQueue pageQueue, Set visited) {
52 | this.pageQueue = pageQueue;
53 | this.visited = visited;
54 | }
55 |
56 | public void scrape() {
57 |
58 | try {
59 | String url = pageQueue.take();
60 |
61 | Document document = Jsoup.connect(url).get();
62 | Elements linksOnPage = document.select("a[href]");
63 |
64 | visited.add(url);
65 | for (Element link : linksOnPage) {
66 | String nextUrl = link.attr("abs:href");
67 | if (nextUrl.contains("http")) {
68 | pageQueue.add(nextUrl);
69 | }
70 | }
71 |
72 | } catch (IOException | InterruptedException e) {
73 | throw new RuntimeException(e);
74 | }
75 |
76 | }
77 |
78 | }
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/bonus/features/FindBestStartingSource.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx.bonus.features;
2 |
3 | import com.davidvlijmincx.bonus.features.setup.StartingPoint;
4 | import com.davidvlijmincx.bonus.features.setup.StartingPointFinder;
5 |
6 | import java.util.concurrent.StructuredTaskScope;
7 |
8 | public class FindBestStartingSource {
9 |
10 | public String FindTheBestStart(){
11 |
12 | try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
13 |
14 | StructuredTaskScope.Subtask fork = scope.fork(StartingPointFinder::source1);
15 | StructuredTaskScope.Subtask fork1 = scope.fork(StartingPointFinder::source2);
16 | StructuredTaskScope.Subtask fork2 = scope.fork(StartingPointFinder::source3);
17 |
18 | scope.join();
19 |
20 | /// Show the state of the virtual thread for debug purposes
21 | // System.out.println("fork.state() = " + fork.state());
22 | // System.out.println("fork.state() = " + fork1.state());
23 | // System.out.println("fork.state() = " + fork2.state());
24 |
25 | StartingPoint result = new StartingPoint("",0); // Get a result from the scope
26 |
27 | System.out.println("result: " + result.getUrlsOnPage() + " with URL: " + result.getUrl() );
28 | return result.getUrl();
29 | } catch (InterruptedException e) {
30 | throw new RuntimeException(e);
31 | }
32 |
33 | }
34 |
35 | }
36 |
37 |
38 | class CriteriaScope {
39 |
40 | private volatile StartingPoint startingPoint;
41 |
42 |
43 | public StartingPoint getResult(){return startingPoint;}
44 | }
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/bonus/features/HighScore.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx.bonus.features;
2 |
3 | public class HighScore {
4 |
5 | public void submitScore(Double score) {
6 | ScoreValidator scoreValidator = new ScoreValidator();
7 | scoreValidator.validateAndSubmit();
8 | }
9 |
10 | }
11 |
12 | class ScoreValidator {
13 |
14 | public void validateAndSubmit() {
15 | ScoreSubmitter scoreSubmitter = new ScoreSubmitter();
16 | scoreSubmitter.submitScore();
17 | }
18 |
19 | }
20 |
21 | class ScoreSubmitter {
22 |
23 | public void submitScore() {
24 | System.out.println("The score is: " + GlobalScoreVariable.SCORE);
25 | }
26 | }
27 |
28 |
29 |
30 |
31 | class GlobalScoreVariable {
32 | final static Double SCORE = 0.0;
33 | }
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/bonus/features/VirtualThreadsWithExecutors.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx.bonus.features;
2 |
3 | import java.util.concurrent.Executors;
4 | import java.util.concurrent.TimeUnit;
5 |
6 | public class VirtualThreadsWithExecutors {
7 |
8 |
9 | public static void main(String[] args) {
10 |
11 |
12 | var executor = Executors.newScheduledThreadPool(1);
13 | executor.schedule(() -> System.out.println("Hello, World!"), 1, TimeUnit.SECONDS);
14 | executor.shutdown();
15 |
16 |
17 | var executor1 = Executors.newSingleThreadExecutor();
18 | executor1.submit(() -> System.out.println("Hello, World!"));
19 | executor1.shutdown();
20 |
21 | }
22 |
23 | }
24 |
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/bonus/features/setup/StartingPoint.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx.bonus.features.setup;
2 |
3 | public class StartingPoint {
4 |
5 | private String url;
6 | private int urlsOnPage;
7 |
8 | public StartingPoint(String url, int urlsOnPage) {
9 | this.url = url;
10 | this.urlsOnPage = urlsOnPage;
11 | }
12 |
13 | public String getUrl() {
14 | return url;
15 | }
16 |
17 | public void setUrl(String url) {
18 | this.url = url;
19 | }
20 |
21 | public int getUrlsOnPage() {
22 | return urlsOnPage;
23 | }
24 |
25 | public void setUrlsOnPage(int urlsOnPage) {
26 | this.urlsOnPage = urlsOnPage;
27 | }
28 | }
29 |
--------------------------------------------------------------------------------
/src/main/java/com/davidvlijmincx/bonus/features/setup/StartingPointFinder.java:
--------------------------------------------------------------------------------
1 | package com.davidvlijmincx.bonus.features.setup;
2 |
3 | public class StartingPointFinder {
4 |
5 |
6 | public static StartingPoint source1(){
7 | sleep(100);
8 | return new StartingPoint("http://localhost:8080/v1/crawl/330/100", 100);
9 | }
10 |
11 |
12 | public static StartingPoint source2(){
13 | sleep(200);
14 | return new StartingPoint("http://localhost:8080/v1/crawl/330/200", 200);
15 | }
16 |
17 | public static StartingPoint source3(){
18 | sleep(250);
19 | return new StartingPoint("http://localhost:8080/v1/crawl/330/50", 50);
20 | }
21 |
22 | private static void sleep(int millis) {
23 | try {
24 | Thread.sleep(millis);
25 | } catch (InterruptedException e) {
26 | throw new RuntimeException(e);
27 | }
28 | }
29 |
30 | }
31 |
--------------------------------------------------------------------------------
/virtual threads workshop.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/davidtos/virtual_thread_workshop/4d99cf33fe0da6396330b0f4db71baa122a17cdd/virtual threads workshop.pdf
--------------------------------------------------------------------------------
/virtual threads workshop.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/davidtos/virtual_thread_workshop/4d99cf33fe0da6396330b0f4db71baa122a17cdd/virtual threads workshop.pptx
--------------------------------------------------------------------------------