├── .gitignore
├── 0000-template.md
├── LICENSE
├── README.md
└── text
    ├── 0001-namespace-syscalls.md
    ├── 0002-event-overhaul.md
    ├── 0003-channels.md
    ├── 0004-ptrace.md
    ├── 0005-scheme-forward-fds.md
    ├── 0006-scheme-path.md
    ├── 0007-base-system-repo.md
    ├── 0008-userspace-signals.md
    └── 0009-namespace-scheme.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | 


--------------------------------------------------------------------------------
/0000-template.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: (fill me in with a unique ident, my_awesome_feature)
 2 | - Start Date: (fill me in with today's date, YYYY-MM-DD)
 3 | - RFC PR: (leave this empty)
 4 | - Redox Issue: (leave this empty)
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | One paragraph explanation of the feature.
10 | 
11 | # Motivation
12 | [motivation]: #motivation
13 | 
14 | Why are we doing this? What use cases does it support? What is the expected outcome?
15 | 
16 | # Detailed design
17 | [design]: #detailed-design
18 | 
19 | This is the bulk of the RFC. Explain the design in enough detail for somebody familiar
20 | with the language to understand, and for somebody familiar with the compiler to implement.
21 | This should get into specifics and corner-cases, and include examples of how the feature is used.
22 | 
23 | # Drawbacks
24 | [drawbacks]: #drawbacks
25 | 
26 | Why should we *not* do this?
27 | 
28 | # Alternatives
29 | [alternatives]: #alternatives
30 | 
31 | What other designs have been considered? What is the impact of not doing this?
32 | 
33 | # Unresolved questions
34 | [unresolved]: #unresolved-questions
35 | 
36 | What parts of the design are still TBD?
37 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Redox OS
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Redox RFCs
  2 | [Redox RFCs]: #redox-rfcs
  3 | 
  4 | Many changes, including bug fixes and documentation improvements can be implemented and reviewed via the normal GitLab merge request workflow.
  5 | 
  6 | Some changes though are "substantial", and we ask that these be put through a bit of a design process and produce a consensus among the Redox community.
  7 | 
  8 | The "RFC" (request for comments) process is intended to provide a consistent and controlled path for new features to enter, so that all stakeholders can be confident about the direction the OS is evolving in.
  9 | 
 10 | ## When you need to follow this process
 11 | [When you need to follow this process]: #when-you-need-to-follow-this-process
 12 | 
 13 | You need to follow this process if you intend to make "substantial" changes to
 14 | Redox, Cargo, Crates.io, or the RFC process itself. What constitutes a
 15 | "substantial" change is evolving based on community norms and varies depending
 16 | on what part of the ecosystem you are proposing to change.
 17 | 
 18 | Some changes do not require an RFC:
 19 | 
 20 |    - Rephrasing, reorganizing, refactoring, or otherwise "changing shape
 21 | does not change meaning".
 22 |    - Additions that strictly improve objective, numerical quality
 23 | criteria (warning removal, speedup, better software compatibility, more
 24 | parallelism, trap more errors, etc.)
 25 | invisible to users-of-redox.
 26 | 
 27 | If you submit a merge request to implement a new feature without going
 28 | through the RFC process, it may be closed with a polite request to
 29 | submit an RFC first.
 30 | 
 31 | ## Before creating an RFC
 32 | [Before creating an RFC]: #before-creating-an-rfc
 33 | 
 34 | A hastily-proposed RFC can hurt its chances of acceptance. Low quality
 35 | proposals, proposals for previously-rejected features, or those that
 36 | don't fit into the near-term roadmap, may be quickly rejected, which
 37 | can be demotivating for the unprepared contributor. Laying some
 38 | groundwork ahead of the RFC can make the process smoother.
 39 | 
 40 | Although there is no single way to prepare for submitting an RFC, it
 41 | is generally a good idea to pursue feedback from other project
 42 | developers beforehand, to ascertain that the RFC may be desirable:
 43 | having a consistent impact on the project requires concerted effort
 44 | toward consensus-building.
 45 | 
 46 | As a rule of thumb, receiving encouraging feedback from long-standing project developers, is a good indication that the RFC is worth pursuing.
 47 | 
 48 | ## What the process is
 49 | [What the process is]: #what-the-process-is
 50 | 
 51 | In short, to get a major feature added to Redox, one must first get the
 52 | RFC merged into the RFC repo as a markdown file. At that point the RFC
 53 | is 'active' and may be implemented with the goal of eventual inclusion
 54 | into Redox.
 55 | 
 56 | * Fork the RFC repo http://gitlab.redox-os.org/redox-os/rfcs
 57 | * Copy `0000-template.md` to `text/0000-my-feature.md` (where 'my-feature' is
 58 | descriptive. don't assign an RFC number yet).
 59 | * Fill in the RFC. Put care into the details: RFCs that do not present
 60 | convincing motivation, demonstrate understanding of the impact of the design, or
 61 | are disingenuous about the drawbacks or alternatives tend to be poorly-received.
 62 | * Submit a merge request. As a merge request the RFC will receive design feedback
 63 | from the larger community, and the author should be prepared to revise it in
 64 | response.
 65 | * RFCs rarely go through this process unchanged, especially as alternatives and
 66 | drawbacks are shown. You can make edits, big and small, to the RFC to
 67 | clarify or change the design, but make changes as new commits to the PR, and
 68 | leave a comment on the PR explaining your changes. Specifically, do not squash
 69 | or rebase commits after they are visible on the PR.
 70 | <!--
 71 | * Once both proponents and opponents have clarified and defended positions and
 72 | the conversation has settled, the RFC will enter its *final comment period*
 73 | (FCP). This is a final opportunity for the community to comment on the PR and is
 74 | a reminder for all members of the sub-team to be aware of the RFC.
 75 | * The FCP lasts one week. It may be extended if consensus between sub-team
 76 | members cannot be reached. At the end of the FCP,  the [sub-team] will either
 77 | accept the RFC by merging the merge request, assigning the RFC a number
 78 | (corresponding to the merge request number), at which point the RFC is 'active',
 79 | or reject it by closing the merge request. How exactly the sub-team decide on an
 80 | RFC is up to the sub-team.
 81 | -->
 82 | 
 83 | ## The RFC life-cycle
 84 | [The RFC life-cycle]: #the-rfc-life-cycle
 85 | 
 86 | Once an RFC becomes active then authors may implement it and submit
 87 | the feature as a merge request to the Redox repo. Being 'active' is not
 88 | a rubber stamp, and in particular still does not mean the feature will
 89 | ultimately be merged; it does mean that in principle all the major
 90 | stakeholders have agreed to the feature and are amenable to merging
 91 | it.
 92 | 
 93 | Furthermore, the fact that a given RFC has been accepted and is
 94 | 'active' implies nothing about what priority is assigned to its
 95 | implementation, nor does it imply anything about whether a Redox
 96 | developer has been assigned the task of implementing the feature.
 97 | While it is not *necessary* that the author of the RFC also write the
 98 | implementation, it is by far the most effective way to see an RFC
 99 | through to completion: authors should not expect that other project
100 | developers will take on responsibility for implementing their accepted
101 | feature.
102 | 
103 | Modifications to active RFC's can be done in follow-up PR's. We strive
104 | to write each RFC in a manner that it will reflect the final design of
105 | the feature; but the nature of the process means that we cannot expect
106 | every merged RFC to actually reflect what the end result will be at
107 | the time of the next major release.
108 | 
109 | In general, once accepted, RFCs should not be substantially changed. Only very
110 | minor changes should be submitted as amendments. More substantial changes should
111 | be new RFCs, with a note added to the original RFC.
112 | 
113 | ## Implementing an RFC
114 | [Implementing an RFC]: #implementing-an-rfc
115 | 
116 | Some accepted RFC's represent vital features that need to be implemented right away.
117 | Other accepted RFC's can represent features that can wait until some arbitrary developer feels like doing the
118 | work.
119 | Every accepted RFC has an associated issue tracking its implementation in the Redox repository; thus that associated issue can be assigned a priority that the team uses for all issues in the Redox repository.
120 | 
121 | The author of an RFC is not obligated to implement it.
122 | Of course, the RFC author (like any other developer) is welcome to post an implementation for review after the RFC has been accepted.
123 | 
124 | If you are interested in working on the implementation for an 'active' RFC, but cannot determine if someone else is already working on it, feel free to ask (e.g. by leaving a comment on the associated issue).
125 | 
126 | 
127 | ## RFC Postponement
128 | [RFC Postponement]: #rfc-postponement
129 | 
130 | Some RFC merge requests are tagged with the 'postponed' label when they are closed (as part of the rejection process).
131 | An RFC closed with “postponed” is marked as such because we want neither to think about evaluating the proposal nor about implementing the described feature until some time in the future, and we believe that we can afford to wait until then to do so.
132 | Postponed PRs may be re-opened when the time is right.
133 | 
134 | Usually an RFC merge request marked as “postponed” has already passed
135 | an informal first round of evaluation, namely the round of “do we
136 | think we would ever possibly consider making this change, as outlined
137 | in the RFC merge request, or some semi-obvious variation of it.”  (When
138 | the answer to the latter question is “no”, then the appropriate
139 | response is to close the RFC, not postpone it.)
140 | 
141 | 
142 | ### Help this is all too informal!
143 | [Help this is all too informal!]: #help-this-is-all-too-informal
144 | 
145 | The process is intended to be as lightweight as reasonable for the
146 | present circumstances. As usual, we are trying to let the process be
147 | driven by consensus and community norms, not impose more structure than
148 | necessary.
149 | 
150 | #### This text
151 | 
152 | This text is originally based on an older version of the README from https://github.com/rust-lang/rfcs .
153 | 


--------------------------------------------------------------------------------
/text/0001-namespace-syscalls.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: namespace-syscalls
 2 | - Start Date: 2016-11-23
 3 | - RFC PR: https://github.com/redox-os/rfcs/pull/4
 4 | - Redox Issue: N/A
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | A namespace is designed to implement the following with one abstraction:
10 | - `cap_enter`, by default
11 | - `chroot`, by allowing a filter on `file:`
12 | - [OS-level virtualization](https://en.wikipedia.org/wiki/Operating-system-level_virtualization) such as [FreeBSD-style Jails](https://en.wikipedia.org/wiki/FreeBSD_jail) or [Illumos-style Zones](https://en.wikipedia.org/wiki/Solaris_Containers), with more complex filtering of scheme access
13 | 
14 | It achieves this with the addition of three syscalls:
15 | - `getns`, which gets the current namespace
16 | - `mkns`, which creates a new namespace
17 | - `setns`, which switches namespaces
18 | 
19 | # Motivation
20 | [motivation]: #motivation
21 | 
22 | Why are we doing this? What use cases does it support? What is the expected outcome?
23 | 
24 | # Detailed design
25 | [design]: #detailed-design
26 | 
27 | ```rust
28 | // Get the current namespace
29 | let old_ns = getns();
30 | // Create a new empty namespace
31 | let new_ns = mkns(&[]);
32 | // Switch to the new namespace
33 | // This is only possible because this process created new_ns
34 | setns(new_ns);
35 | 
36 | // Create a child fork
37 | let child = clone(0);
38 | if child == 0 {
39 |     // Execute a process in the new namespace
40 |     // This will reset the original namespace, preventing setns(old_ns)
41 |     exec("process-to-contain");
42 | }else{
43 |     // Create a new `file:` in the new namespace
44 |     let file_scheme = open(":file", O_CREAT | O_RDWR);
45 | 
46 |     // Switch back to the original namespace
47 |     // This is only possible because this process was once inside of old_ns
48 |     setns(old_ns);
49 | 
50 |     // For every file event in the new `file:`
51 |     for event in file_scheme.events() {
52 |         // Translate it if required and forward it to the original `file:`
53 |         handle_event(event);
54 |     }
55 | }
56 | ```
57 | 
58 | # Drawbacks
59 | [drawbacks]: #drawbacks
60 | 
61 | Why should we *not* do this?
62 | 
63 | - Potential rooting by placing a setuid program in a specially designed container
64 | 
65 | # Alternatives
66 | [alternatives]: #alternatives
67 | 
68 | What other designs have been considered? What is the impact of not doing this?
69 | 
70 | # Unresolved questions
71 | [unresolved]: #unresolved-questions
72 | 
73 | What parts of the design are still TBD?
74 | 
75 | - How to prevent rooting by placing a setuid program in a specially designed container
76 | 


--------------------------------------------------------------------------------
/text/0002-event-overhaul.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: event-overhaul
 2 | - Start Date: 2018-05-19
 3 | - RFC PR: https://github.com/redox-os/rfcs/pull/10
 4 | - Redox Issue: https://github.com/redox-os/kernel/issues/89
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | Overhaul the kernel event system to support mio.
10 | 
11 | # Motivation
12 | [motivation]: #motivation
13 | 
14 | The current kernel event system has one event queue per context, which is not
15 | flexible enough to be used by mio.
16 | 
17 | # Detailed design
18 | [design]: #detailed-design
19 | 
20 | - Make `fevent` system call return `ENOSYS`. The system call number will still
21 |   be needed to support kernel <> scheme event communication.
22 | - Remove old kernel event system.
23 | - Produce a new kernel event system which can support the example
24 | - Port existing event users to the new system
25 | - Ensure that event generators will always trigger events when added to an
26 |   event queue, and will be edge triggered after that
27 | - Rebuild all packages
28 | - Produce a new major release of Redox OS
29 | 
30 | ```rust
31 | // This is a psuedo-Rust example
32 | 
33 | // An event object, which can be converted to [u8] to be written to a file
34 | #[derive(Copy, Clone, Debug, Default)]
35 | #[repr(C)]
36 | pub struct Event {
37 |     pub id: usize,
38 |     pub flags: usize,
39 |     pub data: usize
40 | }
41 | 
42 | // An example file, a network interface
43 | let file = OpenOptions::new()
44 |     .read(true)
45 |     .write(true)
46 |     .custom_flags(O_CLOEXEC | O_NONBLOCK)
47 |     .open("network:").unwrap();
48 | 
49 | // Create a new event queue. This is tracked by file id
50 | let mut event_queue = OpenOptions::new()
51 |     .read(true)
52 |     .write(true)
53 |     .custom_flags(O_CLOEXEC)
54 |     .open("event:").unwrap();
55 | 
56 | // Create a request for read events on the file, with a unique token
57 | let event_request = Event {
58 |     id: file.as_raw_fd(),
59 |     flags: EVENT_READ,
60 |     data: 0x1234
61 | };
62 | 
63 | // Add the event request to this event queue
64 | event_queue.write(&event_request).unwrap();
65 | 
66 | loop {
67 |     // Wait for the next event
68 |     let mut event = Event::default();
69 |     let count = event_queue.read(&mut event).unwrap();
70 |     if count == mem::size_of::<Event>() {
71 |         // The event should have the id set to the network file, the flags set
72 |         // to EVENT_READ, and the data set the same as the request
73 |         assert_eq!(event, event_request);
74 |     } else {
75 |         panic!("invalid size of event: {}", count);
76 |     }
77 | }
78 | ```
79 | 
80 | # Drawbacks
81 | [drawbacks]: #drawbacks
82 | 
83 | A number of programs will need to be updated to the new event system. The old
84 | system will by necessity need to be removed.
85 | 
86 | # Alternatives
87 | [alternatives]: #alternatives
88 | 
89 | The alternatives are to attempt to implement different event handling in a
90 | userspace daemon.
91 | 
92 | # Unresolved questions
93 | [unresolved]: #unresolved-questions
94 | 
95 | What parts of the design are still TBD?
96 | 


--------------------------------------------------------------------------------
/text/0003-channels.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: channels
 2 | - Start Date: 2017-12-29
 3 | - RFC PR: (leave this empty)
 4 | - Redox Issue: (leave this empty)
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | Design for a fast, bidirectional IPC mechanism in Redox.
10 | 
11 | # Motivation
12 | [motivation]: #motivation
13 | 
14 | An easy to use, performant mechanism for IPC would be extremely useful. At the
15 | moment, the primary way of communicating between processes that are not
16 | parent-child is through creating schemes.
17 | 
18 | # Detailed design
19 | [design]: #detailed-design
20 | 
21 | The current design consists of a single scheme, `chan:`, that provides an
22 | interface for creating, interfacing with, and closing channels. Each channel
23 | has one server process and one or more client processes.
24 | 
25 | In this documentation, `<name>` is used to represent a channel name. This can
26 | be anything, but must be known by all the participating processes.
27 | 
28 | Usage:
29 | 1. The server process opens `chan:<name>` with the `O_CREAT` flag.
30 | 2. It listens for connections and duplicates the file descriptor with the path
31 |    `"listen"` to accept (similarly to the way that tcp is implemented in Redox)
32 | 3. The connection can be written to by both the client and server.
33 | 
34 | For unnamed socket:
35 | 1. The process opens `chan:` with the `O_CREAT` flag to create a server.
36 | 2. It duplicates the file descriptor with the path `"connect"` to create a
37 |    client and connect to it.
38 | 3. The server duplicates again, now with the path `"listen"` to accept the
39 |    newly created client.
40 | 3. The connection can be written to by both ends.
41 | 
42 | # Drawbacks
43 | [drawbacks]: #drawbacks
44 | 
45 | There aren't any real drawbacks to this. It's a solid, easily extendible IPC
46 | mechanism.
47 | 
48 | # Alternatives
49 | [alternatives]: #alternatives
50 | 
51 | - Named Pipes
52 | - Actually implement UNIX Domain Sockets
53 | - Do nothing
54 | 


--------------------------------------------------------------------------------
/text/0004-ptrace.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: ptrace
  2 | - Start Date: 2019-06-08
  3 | - RFC PR: (leave this empty)
  4 | - Redox Issue: (leave this empty)
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | A ptrace-like feature for tracing processes in Redox OS
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Currently, we have no way for debuggers to work in Redox. We provide
 15 | no interface for tracing a process' system calls or instructions, and
 16 | no interface for managing another process' memory.
 17 | 
 18 | A good first step for implementing `gdb` or a similar utility would be
 19 | to implement a Linux `ptrace(...)` alternative for Redox. This should
 20 | not only open up the possibility for debuggers, but also system-call
 21 | translation processes like WINE, perhaps for Linux compatibility which
 22 | would rid us the problem of porting software.
 23 | 
 24 | And even with *pure* `ptrace`, without any register or memory reading,
 25 | one can still implement the immensely useful tool `strace`, which
 26 | could serve as an alternative to recompiling the kernel with system
 27 | call debugging turned on or off. This is probably what we should focus
 28 | on getting to work initially, before getting to the good stuff.
 29 | 
 30 | # Detailed design
 31 | [design]: #detailed-design
 32 | 
 33 | The Linux ptrace interface is sometimes considered a huge mistake due
 34 | to its inconsistence and it being just one massive function. The Redox
 35 | interface will have to take that in mind, as well as remove any
 36 | duplicate or otherwise redundant functions.
 37 | 
 38 | All process-controlling functions are implemented as one kernel
 39 | scheme, `proc:`. Opening it up with the `pid` of a tracee and a path
 40 | will perform a specific operation on the process. The benefit of using
 41 | schemes as opposed to a Linux-style function is that we get the
 42 | ability to disallow this feature using namespacing for free. We also
 43 | allow multiplexing using `event:` and therefore can be used in a
 44 | nonblocking fashion just like any other file descriptor.
 45 | 
 46 | ## Process trace
 47 | 
 48 | Opening `proc:<pid>/trace` will give you a file which you can write
 49 | proc-related functions, and closing the file descriptor will detach
 50 | from it automatically. If any breakpoint is set when the file is
 51 | closed, they are deleted and the process is resumed. Only *one* tracer
 52 | can control a process, as I am too close-minded to come up with a
 53 | design that would make sense for running multiple tracers on a single
 54 | tracee.
 55 | 
 56 | That said, if the tracer has the flag `O_EXCL` will instead send
 57 | `SIGKILL` to the tracee when the tracer closes its file. This is to
 58 | prevent any ptrace-contained processes from breaking out. (`O_EXCL`
 59 | can be thought of as meaning the tracer is the only one who controls
 60 | the process, and the process can't live on its own)
 61 | 
 62 | Another flag used in `open` is `O_TRUNC` which will stop the process
 63 | *immediately*. This can be compared to using `PTRACE_ATTACH` on Linux
 64 | as opposed to `PTRACE_SEIZE`. (`O_TRUNC` can be thought of as
 65 | *truncating*/clearing the file's execution. It's a stretch, but I have
 66 | no better idea)
 67 | 
 68 | The most important operation of `ptrace` is of course to put a
 69 | breakpoints! Redox tries to unify the Linux event system and the
 70 | breakpoint system by making the input a bitflags with *one or more*
 71 | breakpoints. It will return when the first breakpoint/event specified
 72 | within the bitmask is reached, which in case it'll add that event for
 73 | reading using the `read` system call (see events below). If an event
 74 | is already set, the `write` returns immediately. (slight exception is
 75 | `O_NONBLOCK`, see that below as well)
 76 | 
 77 | Each breakpoint is set and optionally awaited using the `write` system
 78 | call. Each such call will also resume the tracee in case it's stopped
 79 | after another breakpoint. So if you write a value with no stop bits
 80 | set, the program will run to completion. The exception is manually
 81 | specifying `PTRACE_FLAG_WAIT` (even in blocking mode, see below),
 82 | which will - unless any new stop is set - only wait for an existing
 83 | breakpoint to be reached.
 84 | 
 85 | ### Breakpoints
 86 | 
 87 | - If `PTRACE_STOP_PRE_SYSCALL` is set, the tracee will break on the
 88 |   next start of a syscall. This diverges from Linux' way of using
 89 |   `PTRACE_SYSCALL` for *both* pre- and post- syscall. However it's for
 90 |   a good reason: Signals can occur in the middle of a syscall, and
 91 |   unlike Linux which just delays the signal, we should go the simplest
 92 |   route to minimize kernel code size and let the user choose the
 93 |   behavior they want and not choose for them.
 94 | - If `PTRACE_STOP_POST_SYSCALL` is set, the tracee will break at the
 95 |   end of a syscall, when the return value has just been set in the
 96 |   appropriate register.
 97 | - If `PTRACE_STOP_SINGLESTEP` is set, the tracee is stopped after the
 98 |   execution of just one assembly instruction. If used together with
 99 |   any system-call trace, the system-call method will take precedence
100 |   and allow you to fine-grane how that should work. (Not a special
101 |   case, the syscall trace returns before the instruction returns and
102 |   thus is what is used by the multiplexing trace call!)
103 | - If `PTRACE_STOP_SIGNAL` is set, the tracee is stopped before next
104 |   signal is handled. To the break event (see section on events below),
105 |   the signal number is pushed as the first parameter, and the pointer
106 |   to the signal handler as the second parameter. The pointer can help
107 |   you detect whether a signal will be handled by kernel space or
108 |   userspace, by detecting constants such as `SIG_DFL` and `SIG_IGN`.
109 | - If `PTRACE_STOP_BREAKPOINT` is set, the tracee is stopped when the
110 |   Breakpoint Exception, interrupt number 3, is triggered. This is
111 |   commonly caused by the `int3` instruction with opcode `0xCC` on
112 |   x86_64. The default behavior for breakpoint exceptions is to exit
113 |   the process with the `SIGTRAP` signal, but you can unfortunately not
114 |   catch this with `PTRACE_STOP_SIGNAL` due to the fact that signals
115 |   are sent in a way that never goes through the signal
116 |   handler. Instead of the microkernel working around this just to
117 |   cause an ambigious breakpoint event, the two causes of signals are
118 |   separated. Like the signal breakpoint, the default behavior (to exit
119 |   the process) can be ignored using `PTRACE_FLAG_IGNORE` (read on
120 |   flags below).
121 | - If `PTRACE_STOP_EXIT` is set, the tracee is stopped before the
122 |   process exits. Exits here is from all kinds of contexts, like
123 |   *after* signals (so they will first raise `PTRACE_STOP_SIGNAL` if
124 |   selected), *during* an exit syscall (will never reach
125 |   `PTRACE_STOP_POST_SYSCALL`), as well as when caused by an hardware
126 |   interrupt such as when an out-of-bounds read occurs. You cannot
127 |   abort the exit and continue running the program, but you can inspect
128 |   everything just like any other breakpoint.
129 | 
130 | ### Non-breakpoint events
131 | 
132 | These events will not stop the tracee, but rather keep running in the
133 | background until whatever breakpoint was set alongside this, was reached.
134 | 
135 | - If `PTRACE_EVENT_CLONE` was set, the tracer will wake up when the
136 |   traee creates a new child process. An event will be delivered to the
137 |   tracer with the PID as the first parameter. The child process will
138 |   be in a stopped state, but unless attached to with a separate
139 |   tracer, it will be restarted upon the next ptrace invocation.
140 | 
141 | ### Flags
142 | 
143 | - If `PTRACE_FLAG_IGNORE` is set, the general action being done is
144 |   aborted and returned early. If this is set immediately after a
145 |   pre-syscall breakpoint, the system call is not executed but rather
146 |   by setting the registers, *the tracer* can handle the system
147 |   call. This behavior is known as "sysemu" on Linux. This
148 |   general-purpose flag also lets you ignore tracee signals (except for
149 |   `SIGKILL` which is off-limits and will ignore your wishes).
150 | - If `PTRACE_FLAG_WAIT` is set, the `write` call will not return
151 |   before the breakpoint is reached, but rather await that. This is the
152 |   default behavior whenever `O_NONBLOCK` is not set, but this flags
153 |   lets nonblocking tracers override that behavior. As explained
154 |   briefly above, this flag will not restart a stopped tracee unless a
155 |   new stop bit was set - which is behavior *not* replicated by default
156 |   without `O_NONBLOCK`.
157 | 
158 | ---
159 | 
160 | Because `ptrace` does **not** rely on signals, when a process is
161 | ptrace-stopped (such as attaching to the tracee with `O_TRUNC`
162 | explained above) you can send `SIGCONT` without actually restarting
163 | the process. The process is restarted only using a ptrace operation or
164 | when the tracer file handle is closed. This signal is instead just
165 | scheduled to get handled whenever the tracee starts, which allows the
166 | tracee to raise `SIGSTOP` and let the tracer to restart it only after
167 | a ptrace operation was completed.
168 | 
169 | When the tracee exits (after any selected `PTRACE_STOP_EXIT`
170 | breakpoints are invoked), any blocking operation depending on it stops
171 | and instead returns `ESRCH`. It does not, however, reap the zombie
172 | process. Therefore, if the tracee is your own child process you should
173 | invoke `waitpid` immediately after a `ESRCH` error, which will also
174 | allow you to obtain the exit status in the normal fashion, without
175 | putting a breakpoint specifically on exit.
176 | 
177 | ### Events
178 | 
179 | Events give the tracer information about breakpoints or actions the
180 | tracee has taken. There are two types of events: Breakpoint events,
181 | and non-breakpoint events. Only breakpoint events stop the tracee when
182 | reached, other events only wake up the tracer, while the tracee keeps
183 | going. The way you receive events is by `read`ing a `PtraceEvent`
184 | structure from the file. Reads are not blocking, and will return `0`
185 | when no event was able to be read.
186 | 
187 | Events are read sequencially, i.e. follow first-in-last-out. The
188 | standard behavior for handling non-breakpoint events is to read them
189 | all and then retry waiting for the breakpoint to be reached using
190 | `PTRACE_FLAG_WAIT`. Any unread events from the last operation will
191 | cause a new one to return immediately, in order to prevent a possible
192 | race condition where you think you've read all events but another one
193 | occurs right when want to retry the wait for a breakpoint to be reached.
194 | 
195 | The structure has a value `cause` specifying what bit caused the
196 | tracer to wake up, as well as a set of values like `a` (first
197 | parameter), `b` (second parameter), `c` (third parameter), etc. For
198 | example, if the input was `PTRACE_STOP_SIGNAL | PTRACE_EVENT_CLONE`,
199 | the bitmask may be either `PTRACE_STOP_SIGNAL` or `PTRACE_EVENT_CLONE`
200 | depending on which event was hit first. The `a` value of
201 | `PTRACE_STOP_SIGNAL` is the signal number which caused the breakpoint
202 | to be hit, while the `a` value of `PTRACE_EVENT_CLONE` is the PID of
203 | the tracee's new child process.
204 | 
205 | ### Nonblocking mode
206 | 
207 | In nonblocking mode, a ptrace call without the `PTRACE_FLAG_WAIT` bit
208 | set will return `1` immediately. Any breakpoint specified is set, and
209 | will like usual overwriting any existing breakpoints. Note that the
210 | file will send events to the `event:` scheme, meaning you can
211 | multiplex multiple tracers.
212 | 
213 | `EVENT_READ` is triggered whenever the first event arrives. Since an
214 | event only gets pushed to the stack if it's within the specified write
215 | bitmask, all events in the stack are of interest and this notification
216 | means you should immediately read them all.
217 | 
218 | `EVENT_WRITE` is reserved, for now.
219 | 
220 | ## Modify registers
221 | 
222 | Another important part of Linux `ptrace` is reading and writing
223 | registers. Opening the file `proc:<pid>/regs/int` will allow you to
224 | `read`/`write` a struct consisting of the integer values of all
225 | registers. The same for `proc:<pid>/regs/float`, but for all
226 | floating-point values. This is similar to Linux's `GETREGS`/`SETREGS`.
227 | 
228 | ## Modify memory
229 | 
230 | There are many different ways to read a process' memory in Linux, but
231 | Redox should put effort in unifying these functions. Namely, the
232 | `proc:` scheme is an excellent candidate for a unified
233 | memory-modifying system. Opening `proc:<pid>/mem` will allow you to
234 | seek/read/write around the memory of another process. This is a
235 | unification of the following calls in Linux:
236 | 
237 | - `/proc/<pid>/mem`
238 | - `PTRACE_POKEDATA`/`PTRACE_PEEKDATA`
239 | - `process_vm_readv`/`process_vm_writev`
240 | 
241 | ## Security
242 | 
243 | By default a process should only be allowed to control a process owned
244 | by the current user, as well as being an anchestor of the process,
245 | direct or indirect. The main motivation for allowing indirect
246 | subprocesses is so one can trace threads of a direct subprocess.
247 | 
248 | This restriction is lifted by processes owned by `root`, which can
249 | trace any process. In the future, a capability-like system could be
250 | put in place to allow specific executables to trace any process owned
251 | by the current user without having root access.
252 | 
253 | ## Example
254 | 
255 | ```rust
256 | let pid = unsafe { syscall::clone()? };
257 | if pid == 0 {
258 |     // This is the child
259 | 
260 |     // Wait until parent is ready to trace
261 |     syscall::kill(syscall::getpid()?, syscall::SIGSTOP)?;
262 | 
263 |     println!("Some prints here");
264 |     println!("Some other syscalls");
265 |     eprintln!("One can even print to STDERR");
266 |     // Do things here. `fexec` another process, maybe?
267 | } else {
268 |     // This is the parent
269 | 
270 |     // Wait for the child-initiated SIGSTOP to complete.
271 |     syscall::waitpid(pid, &mut status, WUNTRACED)?;
272 | 
273 |     // ptrace attach: Stop the process using internal ptrace mechanism
274 |     // (not SIGSTOP!)
275 |     let mut trace = OpenOptions::new()
276 |         .read(true)
277 |         .write(true)
278 |         .truncate(true)
279 |         .open(&format!("proc:{}/trace", pid))?;
280 |     // obtain a handle to the process registers
281 |     let mut regs = File::open(&format!("proc:{}/regs/int", pid))?;
282 |     let mut status = 0;
283 | 
284 |     // It is safe to schedule a continuation of child process, because
285 |     // it is still stopped by ptrace
286 |     syscall::kill(pid, SIGCONT)?;
287 | 
288 |     trace.write(syscall::PTRACE_STOP_PRE_SYSCALL | syscall::PTRACE_EVENT_CLONE)?;
289 |     // Mostly ignore event... usually you can get some interesting
290 |     // data from it
291 |     let mut event: PtraceEvent = PtraceEvent::default();
292 |     trace.read(&mut event)?;
293 |     while event.cause & syscall::PTRACE_EVENT_MASK != 0 {
294 |         // In reality, you'll actually want to handle this event, or
295 |         // it makes no sense to listen for it at all. This is just an
296 |         // example to show you how you can handle non-breakpoint
297 |         // events though.
298 |         trace.write(syscall::PTRACE_FLAG_WAIT)?;
299 |         trace.read(&mut event)?;
300 |     }
301 |     // This assertion is safe because if the process exits, the write
302 |     // call returns ESRCH
303 |     assert_eq!(event.cause, syscall::PTRACE_STOP_PRE_SYSCALL)?;
304 | 
305 |     let mut registers = syscall::IntRegisters::default();
306 |     regs.read(&mut registers)?;
307 | 
308 |     println!("System call: {}", registers.orig_rax);
309 |     println!("Replacing with exit!");
310 | 
311 |     registers.orig_rax = syscall::SYS_EXIT;
312 |     registers.rdi = 0;
313 | 
314 |     regs.write(&registers)?;
315 | 
316 |     // trace.write(syscall::PTRACE_STOP_POST_SYSCALL)?; // wait for the completion of the system call
317 | 
318 |     trace.write(syscall::PtraceFlags::empty())?; // don't set any stops, rather run the program to the end, which is like right now
319 |     syscall::waitpid(pid, &mut status, 0)?; // reap zombie process
320 | 
321 |     // trace file dropped here: process tracing detached and process
322 |     // implicitly resumed if it hadn't already been, y'know, killed
323 | }
324 | ```
325 | 
326 | # Drawbacks
327 | [drawbacks]: #drawbacks
328 | 
329 | There is no drawback to introducing more features, except for the mere
330 | size of the microkernel being increased. This is worth it as `ptrace`
331 | will open up a lot of different possibilities, maybe even so that we
332 | can eventually run linux programs on Redox.
333 | 
334 | # Alternatives
335 | [alternatives]: #alternatives
336 | 
337 | One could, of course, use the Linux way of doing things. I think
338 | schemes provide excellent interfaces with a clear sense of ownership
339 | ("This ptrace invocation operates on this tracee, not anything else"),
340 | and I'm especially convinced of using them when that means we can
341 | leverage the existing, excellent, namespacing support ([see
342 | example](https://gitlab.redox-os.org/redox-os/contain/blob/6a1c070381f2c8b56c688c8cca454a818ee72520/src/main.rs#L21-33)).
343 | 
344 | An alternative to using a unified `proc:` scheme would be to, of
345 | course, split it up into one for ptrace + registers and one for
346 | writing memory. This was what the original RFC first suggested, before
347 | [@zen3ger](https://gitlab.redox-os.org/zen3ger) mentioned how Solaris
348 | implements a `ptrace(...)` function as a userspace library over their
349 | ProcFS.
350 | 
351 | There are lots of possible alternatives, one of which was implemented
352 | and tried out. However, out of the ones I've considerd, this one
353 | should be the most scalable over time.
354 | 


--------------------------------------------------------------------------------
/text/0005-scheme-forward-fds.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: scheme_forward_fds
  2 | - Start Date: 2021-06-12
  3 | - RFC PR: https://gitlab.redox-os.org/redox-os/rfcs/-/merge_requests/17
  4 | - Redox Issue: (leave this empty)
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | This feature allows schemes to forward file descriptors local _to the process
 10 | that is responsible for handling that scheme_. In other words, rather than
 11 | giving the kernel a word-size integer representing a scheme-local file
 12 | descriptor number, it can instead give the kernel a fully valid file descriptor
 13 | in its _process file descriptor namespace_, that might have originated from a
 14 | completely different scheme in the first place, thus _forwarding_ file
 15 | descriptors.
 16 | 
 17 | # Motivation
 18 | [motivation]: #motivation
 19 | 
 20 | Microkernels are intrinsically meant to be as minimal as possible, and Redox
 21 | schemes serve as a useful IPC primitive. The ability to compose schemes, on the
 22 | other hand, is limited by the latency delay of chaining scheme calls, if scheme
 23 | A needs to communicate with scheme B, and in turn, scheme C, etc. Sometimes,
 24 | the schemes in such chains need to access the inner data, but in many cases,
 25 | filtering e.g. directory entries and enforcing access checks is sufficient.
 26 | 
 27 | A good example is the `irq:` scheme. IRQ handling must be be _fast_, and as
 28 | low-latency as possible. Ideally the only thing before the actual code, should
 29 | be to mask the interrupt and directly switch to the handler. However, the
 30 | current IRQ scheme also handles IRQ allocation for drivers, which adds
 31 | unnecessary kernel code, simply because adding a wrapper scheme would slow down
 32 | IRQ handling. This would also apply for hypothetical I/O port and MSR
 33 | scheme-based interfaces.
 34 | 
 35 | Additionally, The `proc:` scheme can probably at some point in the future, be
 36 | divided into a userspace and kernel part, providing convenience APIs such as
 37 | `proc:PID/memory` while forwarding performance-critical APIs such as ptrace
 38 | pipes.
 39 | 
 40 | With the feature described by this RFC, resource allocation can be moved to a
 41 | userspace scheme, and giving that wrapper scheme handler near-full access to
 42 | the kernel scheme. Once opened by a client, it will have a file descriptor that
 43 | directly points to the kernel IRQ scheme.
 44 | 
 45 | The same holds logic holds for userspace. Disk partitions are currently managed
 46 | by the disk drivers, mainly to maintain flexibility and performance, as a
 47 | middleman scheme would add latency to every read or write call. And while it
 48 | may be reasonable to add this particular duplicate functionality to multiple
 49 | drivers, this feature would enable disk drivers to simply add track allowed
 50 | ranges in each file descriptor, and then have partition daemons provide
 51 | forwarding schemes.
 52 | 
 53 | Most importantly, __a chroot tool implemented using schemes and namespaces,
 54 | would also become zero-cost for data, even if it may have to filter metadata
 55 | access, i.e. directory structures.__ Scheme file forwarding is not
 56 | _sufficient_, as some other interfaces such as `openat` may be required first,
 57 | but it is _necessary_ for allowing fast chroots.
 58 | 
 59 | # Detailed design
 60 | [design]: #detailed-design
 61 | 
 62 | On Redox, each file description is associated with a scheme ID and a
 63 | scheme-provided identifier, and is refcounted. Hence, schemes respond to
 64 | `SYS_OPEN` and `SYS_DUP` calls by returning that identifier, causing the kernel
 65 | to insert a new file descriptor, pointing to the newly-created file
 66 | description. A new file descriptor is thus _created_ in the process, with the
 67 | scheme being responsible for handling all subsequent system calls operating on
 68 | that file descriptor.
 69 | 
 70 | Alternatively, this new feature additionally allows schemes to _forward_ an
 71 | existing file description, as opposed to creating a new file description.
 72 | 
 73 | Scemes normally respond to kernel calls by reusing the same packet it received
 74 | from the kernel earlier, keeping all fields (even though every field except
 75 | `id` is ignored), and setting `a` to the return value of that scheme message.
 76 | The only exception is when triggerings events, where it instead sets the `id`
 77 | field to zero, and sets `a` to `SYS_FEVENT`, `b` to the scheme-provided number,
 78 | and `c` to the event flags.
 79 | 
 80 | It would be possible to add a new syscall number exclusively used as a scheme
 81 | response code, similar to `SYS_FEVENT` messages. However, since the response
 82 | needs to be associated with `packet.id`, which as a `u64` is not guaranteed to
 83 | fit within a `usize`, the current implementation introduces a new error code,
 84 | `ESKMSG`. Paired with `packet.b` = `SKMSG_FRETURNFD` and `packet.c` =
 85 | scheme-owned fd, the kernel will move that file descriptor out of the scheme's
 86 | file table, and then transparently store that file description into the
 87 | caller's file table, regardless of whether the file descriptor was created or
 88 | forwarded.
 89 | 
 90 | # Drawbacks
 91 | [drawbacks]: #drawbacks
 92 | 
 93 | The trivial drawback is that it adds complexity to the kernel, although in this
 94 | case it is relatively minor. The question is rather whether we actually do want
 95 | the functionality of being able to forward file descriptors that previously
 96 | originated from different schemes.
 97 | 
 98 | # Alternatives
 99 | [alternatives]: #alternatives
100 | 
101 | The obvious alternative would be to simply allow schemes to communicate with
102 | the caller requesting a file descriptor from `SYS_OPEN` or `SYS_DUP`, via
103 | previously suggested "fd channels", such as `cable:` or `sendfd:`. On the other
104 | hand, this would require the client to implement different logic for this, thus
105 | becoming less flexible and most importantly no longer transparent.
106 | 
107 | That said, scheme file forwarding is not incompatible with such file descriptor
108 | channels, and when sending or receiving large sets of file descriptors, scheme
109 | file forwarding may incur unnecessary performance overhead, in comparison.
110 | 
111 | # Unresolved questions
112 | [unresolved]: #unresolved-questions
113 | 
114 | The primary unresolved question, is what syscall interface schemes should use
115 | when forwarding. The alternative to the current implementation's "scheme-kernel
116 | messages" (`ESKMSG`, `SKMSG_*`, and the rest of the packet), would be to use an
117 | interface similar to `SYS_FEVENT`.
118 | 
119 | A minor question is whether forwarded files should be moved or cloned from the
120 | scheme's file table. The current implementation moves them, but the scheme can
121 | dup the file descriptor to keep it, if moved by the kernel, or respectively,
122 | close it later, if cloned by the kernel.
123 | 


--------------------------------------------------------------------------------
/text/0006-scheme-path.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: scheme-path
  2 | - Start Date: 2024-01-17
  3 | - RFC PR: (leave this empty)
  4 | - Redox Issue: (leave this empty)
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | As discussed with jeremy_soller, bjorn3, jcake, rw_van, 4lDO2, et al.
 10 | 
 11 | To alleviate the difficulties created by having `scheme_name:` as the root of a scheme path, the proposed new format is `/scheme/scheme_name` where `scheme` is the literal word "scheme" and `scheme_name` is the name of the scheme.
 12 | 
 13 | There will be a transition period where `scheme_name:` will continue to be accepted as a scheme path.
 14 | 
 15 | # Motivation
 16 | [motivation]: #motivation
 17 | 
 18 | The current `scheme_name:` format creates multiple problems.
 19 | 
 20 | - Linux programs cannot detect `scheme_name:` as an absolute path and instead treat it as a relative path.
 21 | 
 22 | - Paths containing colon (`:`) are not properly parsed when they are part of e.g. `$PATH` or other colon-separated lists.
 23 | 
 24 | - Rust's `std::path` library does not easily integrate new `Prefix` variants. Adding a `Scheme` variant to `Prefix` causes some existing Rust programs and libraries to not compile due to missing `match` arms, and may create other problems. An OS-aware implementation of `std::path` is under consideration by the Rust team but it is not imminent.
 25 | 
 26 | # Detailed design
 27 | [design]: #detailed-design
 28 | 
 29 | ## Behavior
 30 | 
 31 | 1. The new format for schemes will be `/scheme/scheme_name` where `scheme` is the literal string "scheme" and `scheme_name` is the name of the scheme (resource or service).
 32 | 
 33 | 2. Slash (`/`) will be the only recognized path separator (after the transition is complete). Once the transition to the new format is entirely complete, `:` will be an allowed character in filenames.
 34 | 
 35 | 3. The `file` scheme will now **always** be the default scheme. If a path does not start with `/scheme` but it does start with `/`, it will have `/scheme/file` prefixed during canonicalization.
 36 | 
 37 | 4. Relative paths are now allowed to back up out of a scheme. Previously, `scheme_name:/..` would resolve to `scheme_name:`. Now, `/scheme/scheme_name/..` resolves to `/scheme` and `/scheme/scheme_name/../..` resolves to `/` which is equivalent to `/scheme/file`. `/..` resolves to `/` as on Unix.
 38 | 
 39 | 5. Scheme providers currently receive paths for `open` requests with the scheme component stripped off by the kernel, with the path argument to the `open` call assumed as an absolute path within the receiving scheme. No change is required here.
 40 | 
 41 | **NOTE:** The creation/mounting of new schemes is documented in a separate RFC.
 42 | 
 43 | 6. To ease the transition to the new format, functionality that parses `scheme_name:`
 44 | should be guarded with two feature flags, `scheme_fmt_warn` and `scheme_fmt_compat`.
 45 | 
 46 |     - If `scheme_fmt_warn` is enabled, an error message is printed to stderr or logged to the console when the `scheme_name:` format is used.
 47 | 
 48 |     - If `scheme_fmt_compat` is enabled, the `scheme_name:` format continues to work as previously.
 49 | 
 50 |     - Disabling `scheme_fmt_compat` will cause the `scheme_name:` format to be treated as a regular file name or as a path relative to the current working directory.
 51 | 
 52 |     - Initially, `scheme_fmt_compat` will be enabled by default and `scheme_fmt_warn` will be disabled.
 53 | 
 54 | In the descriptions below,
 55 | 
 56 | - items marked **DEPENDENCY** are required to be implemented before some other change can be completed.
 57 | 
 58 | - items marked **USER VISIBLE** are higher priority as the user will see filenames using the old format.
 59 | 
 60 | ## Disk Partitions
 61 | 
 62 | `RedoxFS` is the scheme provider for the `file:` scheme, which will be renamed `/scheme/file`. RedoxFS handles one disk partition per scheme. In a scenario where multiple disk partitions are to be mounted,
 63 | 
 64 | 1. There will be a "root partition" managed by RedoxFS and named `/scheme/file`. Its content will normally be referred to starting from `/`, i.e. `/scheme/file/bin` will be referred to as `/bin`.
 65 | 
 66 | 2. Additional partitions will be managed by other instances of RedoxFS, (or a single instance managing multiple schemes) with their own scheme names, e.g. a scheme name of "file.home" would appear as `/scheme/file.home` and identify the RedoxFS instance that manages the "home" partition.
 67 | 
 68 | 3. There will be a symbolic link in the root partition that places `/scheme/file.home` at its "mount point". e.g. `/home` -> `/scheme/file.home`. This will allow a `user` folder on the `home` partition to have an apparent name of `/home/user` and a full name of `/scheme/file.home/user`.
 69 | 
 70 | Further details on the topic of disk partitions, mount points, `realpath` and `fpath` are outside the scope of this RFC.
 71 | 
 72 | ## Code Changes
 73 | 
 74 | ### redox-path
 75 | 
 76 | A new crate for handling of new and legacy paths is now available. `redox-path` includes `canonicalize_with_cwd`, which canonicalizes paths that use the new format. Legacy-format paths are not currently canonicalized as some schemes use paths that must be formatted very precisely.
 77 | 
 78 | ### PATH variable
 79 | 
 80 | In future, the PATH environment variable will be converted to colon-separated format. In the short term, it is recommended that the PATH be limited to `/usr/bin` and all commands be copied or linked in that directory.
 81 | 
 82 | ### Kernel Scheme Dispatch
 83 | 
 84 | The kernel "Scheme Dispatch" functionality must be changed first, including compatibility feature guards. Once that is done, all other changes can be done as time permits. See also the RFC regarding the "Namespace" scheme.
 85 | 
 86 | #### Current implementation
 87 | 
 88 | [syscall::fs::open](https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/syscall/fs.rs?ref_type=heads#L44) is responsible for dispatching `open` calls to the appropriate scheme. It assumes paths are already in canonicalized form. 
 89 | 
 90 | It currently obtains the scheme name by splitting the name at the first `:`. It dispatches to the appropriate scheme by looking up the scheme in the current namespace. `:scheme` is naturally parsed into an empty scheme name `""` with a path of `scheme`. This is interpreted as a request to the `RootScheme` with a path argument of `scheme`.
 91 | 
 92 | This functionality will continue to be provided until we are ready to delete it, but with feature guards described below.
 93 | 
 94 | The path format for dispatch to the `RootScheme` is discussed in another RFC.
 95 | 
 96 | #### Changes required
 97 | 
 98 | 1. `syscall::fs::open` will now also need to parse the new format by stripping the `/scheme/` literal and taking the string up to the next `/` as the scheme name. A path that does not begin with `/scheme/scheme_name` (or is not in the previous format, when `scheme_fmt_compat` is enabled) is considered an error.
 99 | 
100 | 2. The `scheme_fmt_warn` and `scheme_fmt_compat` feature guards should be implemented. Logging to the console will be done for old format scheme references.
101 | 
102 | **DEPENDENCY** This work must be completed before any other conversion work can proceed.
103 | 
104 | ### Rust's std::path
105 | 
106 | Rust's `std::path` has some junk code in it due to *partial* rejection of past Redox pull requests. All the Redox-specific code should now be removed from `std::path` as Redox will now work with Linux-format paths. It is proposed that this work should be done once we are confident in the new `/scheme/scheme_name` format, but as soon as possible after that.
107 | 
108 |   - [here](https://github.com/rust-lang/rust/blob/master/library/std/src/path.rs#L303)
109 | 
110 |   - [here](https://github.com/rust-lang/rust/blob/master/library/std/src/path.rs#L2180)
111 | 
112 |   - [here](https://github.com/rust-lang/rust/blob/master/library/std/src/path.rs#L2673)
113 | 
114 |   - Possibly others
115 | 
116 | ### Redox's std::path
117 | 
118 | All Redox-specific code in Redox's fork of Rust's `std::path` should be removed. Similar to the above changes, plus removal of the `Scheme` variant of Prefix.
119 | 
120 | ### Camino
121 | 
122 | There is a [PR pending](https://github.com/camino-rs/camino/pull/88) for Camino, to adopt the `Scheme` Prefix variant. It should be closed.
123 | 
124 | ### relibc Canonicalize
125 | 
126 | #### Current functionality
127 | 
128 | Currently, paths that use libc `open` are canonicalized to create a full path. The canonicalization uses the current working directory `CWD` as an additional information source during canonicalization.
129 | 
130 | - If the path given to `open` starts with `scheme_name:`, it is taken as an absolute path and is used unchanged.
131 | 
132 | - If the path starts with `/`, it is taken as absolute within the scheme of `CWD` (typically `file:`), and the scheme portion of `CWD` is prepended.
133 | 
134 | - If the path does not start with `scheme_name:` or `/`, it is prefixed with `CWD`.
135 | 
136 | #### Changes
137 | 
138 | 1. The [canonicalize_with_cwd](https://gitlab.redox-os.org/redox-os/relibc/-/blob/master/src/platform/redox/path.rs?ref_type=heads#L16) function in `relibc` needs to be modified to accept `/scheme/scheme_name/path` as a new format, for specifying both the path and the CWD. This function will convert all paths to the new format prior to `syscall::open`.
139 | 
140 | 2. This function will need to be changed to use `/scheme/file` as the scheme for an absolute path that does not contain a scheme.
141 | 
142 | 3. The `scheme_fmt_warn` and `scheme_fmt_compat` feature guards should be implemented, with old format scheme references being warned on `stderr`.
143 | 
144 | 4. An `assert` that the path must not follow the old format should be included when `scheme_fmt_compat` is disabled, at least for some initial period. This will trigger an abort, and when used with `RUST_BACKTRACE=full` can help determine the source of the problem.
145 | 
146 | 5. After all old-format code has been removed, the `assert` should be removed, and paths starting with `name:` will be treated as allowed relative paths and handled normally.
147 | 
148 | ### realpath
149 | 
150 | `realpath` is a `libc` function that on Linux takes a file descriptor and returns an absolute path that can be used to open the same file. This path will have all symbolic links resolved.
151 | 
152 | See [Unresolved questions](#unresolved-questions) regarding `realpath` issues.
153 | 
154 | On Redox, `realpath` uses the scheme service `fpath` to determine the path. `fpath` is expected to return a full pathname including the scheme. Current `fpath` implementations return paths using `scheme_name:path` format.
155 | 
156 | `realpath` should be modified as follows:
157 | 
158 | 1. On return from `fpath`, `realpath` will strip `/scheme/file` from paths that contain it, so a path such as `/scheme/file/home` will be reported as `/home`.
159 | 
160 | 2. The `scheme_fmt_warn` feature guard will enable `realpath` to check if the scheme format is `scheme_name:` and report to `stderr` if it is.
161 | 
162 | 3. The `scheme_fmt_compat` feature guard will enable `realpath` to replace from `scheme_name:` with `/scheme/scheme_name/` (with appropriate bounds checking and overlapping copy). `file:` will be stripped from paths that contain it, replacing it with a leading `/` if needed.
163 | 
164 | ### fpath implementations
165 | 
166 | Several scheme providers implement `fpath`, where the scheme-relative path is calculated for a given file descriptor, and the scheme prefix is inserted by the scheme provider.
167 | 
168 | 1. `RedoxFS` is the main provider of `fpath` and should be updated as soon as possible to return a path in the new format.
169 | 
170 | 2. We will need to do a survey of schemes to determine which ones provide `fpath`.
171 | 
172 | 3. We will need to work through a prioritized list of `fpath` implementations to update them.
173 | 
174 | ### redox-scheme
175 | 
176 | `redox-scheme` is the current best practice for creating user-space schemes. It should be updated to use the new scheme format. Scheme creation is discussed in the "Namespace Scheme" RFC.
177 | 
178 | **DEPENDENCY** - Is `redox-scheme` ready for all drivers to be migrated?
179 | 
180 | ### redox-event
181 | 
182 | `redox-event` is the current best practice for using an event-based interface.
183 | 
184 | 1. It should be updated to use the new scheme format.
185 | 
186 | 2. A timer service should be added to `redox-event`, as many event subscribers use timers/timeouts directly, and in fact the use of timers is often the motivation for using the event scheme.
187 | 
188 | 3. `epoll` requires the ability to gather all events that are available. This implies a need for a non-blocking check for events, e.g. `is_ready` or `maybe_next`.
189 | 
190 | **DEPENDENCY** - A timer service should be implemented as soon as possible. Is `redox-event` ready for all (most) event users to be migrated?
191 | 
192 | **DEPENDENCY** - A non-blocking check for events is needed by `epoll`.
193 | 
194 | ### Scheme Clients
195 | 
196 | `event:`, `time:` and `file:` schemes are referenced in many places throughout the Redox code. They will all need to be modified.
197 | 1. Where `event:` and `time:` are referenced, consider using the `redox-event` crate (depends on the [timer service](#redox-event) being added to `redox-event`).
198 | 
199 | 2. Where `file:` is used, should we simply delete the scheme reference (since `/scheme/file` is now always the default), or should we convert it to `/scheme/file/`?
200 | 
201 | ### epoll
202 | 
203 | [epoll](https://gitlab.redox-os.org/redox-os/relibc/-/blob/master/src/platform/redox/epoll.rs?ref_type=heads#L56) in `relibc` uses the `event:` and `time:` schemes directly. It should be converted to use `redox-event` if possible. A non-blocking check for events may be required. If this is not feasible, `epoll` should be modified to use the new scheme format.
204 | 
205 | ### libc: Scheme
206 | 
207 | The `libc:` scheme is implemented in `relibc` as way to resolve paths that depend on the process context, e.g. `/dev/tty`. `/dev/tty` is a symbolic link in the `file:` scheme that refers to `libc:tty`. The `libc:` scheme parses the full path used for `open` (`"libc:tty"` in this case). It needs to be updated to use the new format. The primary use of the `libc` scheme is via symbolic links in the [filesystem configs](#filesystem-configs).
208 | 
209 | On first glance, it looks like the `libc:` scheme can be updated with a [single change](https://gitlab.redox-os.org/redox-os/relibc/-/blob/master/src/platform/redox/libcscheme.rs?ref_type=heads#L6) to a constant. However, further investigation should be done.
210 | 
211 | **DEPENDENCY** This change should be done at the same time as updating the filesystem configs. Not many Redox applications use this functionality, so it is not urgent unless we are porting Linux TUI apps.
212 | 
213 | ### RedoxFS
214 | 
215 | 1. RedoxFS has its own [canonicalize](https://gitlab.redox-os.org/redox-os/redoxfs/-/blob/master/src/mount/redox/scheme.rs?ref_type=heads#L164) functionality that needs to be updated.
216 | 
217 | 2. Old format symbolic links are supported. Because old format paths are not canonicalized, RedoxFS needs to convert old format links to the new format when resolving the links.
218 | 
219 | 3. [fpath](https://gitlab.redox-os.org/redox-os/redoxfs/-/blob/master/src/mount/redox/scheme.rs?ref_type=heads#L634) will need to be revised.
220 | 
221 | ### Contain
222 | 
223 | Contain makes extensive use of filename parsing, canonicalization and scheme references. The work to be done is not listed here as it is extensive.
224 | 
225 | **DEPENDENCY** The `desktop-contain` filesystem config should be updated at the same time as the updates to Contain.
226 | 
227 | ### Ion
228 | 
229 | Ion has some code that strips or adds the `file:` prefix.
230 | 
231 | **User Visible** Although `file:` is removed, other scheme-prefixed paths are displayed to the user. This should be minimized by updates to `realpath`.
232 | 
233 | ### Bash, Dash, Nushell, etc.
234 | 
235 | Shells that have Redox forks likely have `file:` prefix stripping to enable `glob` pattern expansion. This will need to be removed.
236 | 
237 | ### Other libraries and crates
238 | 
239 | Any library or crate that has a Redox fork likely has some scheme-related code. They will need to be examined.
240 | 
241 | ### Filesystem Configs
242 | 
243 | All filesystem configs (e.g. `desktop.toml`) need to be updated to the new format.
244 | 
245 | 1. Symbolic links e.g. `/dev/tty` will need to be in the new format.
246 | 
247 | 2. `desktop-contain.toml` will need to be revised to the new format.
248 | 
249 | ## Documentation Changes
250 | 
251 | ### Book
252 | 
253 | All documentation about schemes will need to be updated. There are several pages that discuss schemes.
254 | 
255 | ### README files
256 | 
257 | A scan of the README files for each repo will need to be done, with updates as needed.
258 | 
259 | # Drawbacks
260 | [drawbacks]: #drawbacks
261 | 
262 | 1. Amount of Rework
263 | 2. Moving away from URI format
264 | 3. History of pushing for URI format to be accepted by the Rust community
265 | 
266 | # Alternatives
267 | [alternatives]: #alternatives
268 | 
269 | ## URI (Status Quo)
270 | 
271 | The status quo of URI-style `scheme_name:` paths seems to have reached its limit. Although it is a recognized path format in POSIX, it is not used for filesystem references. In practice URIs are converted by services into simple Unix paths by web applications that process URIs. 
272 | 
273 | Our use of this format is now impeding the porting of Linux applications and FOSS Rust applications.
274 | 
275 | ## Plan 9 Paths
276 | 
277 | Plan 9 paths are similar to the proposed Redox scheme format, except that Plan 9 services are mostly present at the root of the filesystem, e.g. `/tcp`. Redox will have all services (schemes) logically under the `/scheme/` folder. This will help keep the filesystem root clean and allow for more *nix-like filenames.
278 | 
279 | ## Other Path Formats
280 | 
281 | Using some Windows-compatible path format could alleviate some of the problems of having scheme-based names. [UNC](https://en.wikipedia.org/wiki/Path_(computing)#Universal_Naming_Convention) and the DeviceNS formats are supported by Rust `std::path`. However, there doesn't seem to be a benefit over using `/scheme/`.
282 | 
283 | POSIX allows that paths starting with `//` are "implementation defined" but other points in the POSIX specification state that a prefix of more than one slash is to be treated as a single slash.
284 | 
285 | # Unresolved questions
286 | [unresolved]: #unresolved-questions
287 | 
288 | 1. On Redox, the `fpath` scheme service is used as the mechanism to obtain a path for a file descriptor. However, it produces results that are not guaranteed to be correct. A future implementation of an `fpath`-like service will address the problems. Resolving this issue is outside the scope of this RFC.


--------------------------------------------------------------------------------
/text/0007-base-system-repo.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: base-system-repo
 2 | - Start Date: 2024-12-29
 3 | - RFC PR: (leave this empty)
 4 | - Redox Issue: (leave this empty)
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | Merge the repos forming the base system into a single repo.
10 | 
11 | # Motivation
12 | [motivation]: #motivation
13 | 
14 | While we will likely be able to stabilize the userspace ABI relatively soon through dynamic linking of relibc and libredox, the syscall interface as well as the interface of system services between each other and with relibc is likely to take much longer to stabilize if it ever gets stabilized. Many internal improvements currently require merge requests across multiple repos that need to be merged at the same time which makes such changes harder to do. For this reason some repos contain programs that don't actually quite fit in the repo but are only there because it makes changes easier. For example the driver repo contains inputd, fbbootlogd and fbcond, none of which are actually drivers, but all of them are somewhat coupled with the graphics drivers. And having crates like redox-scheme and redox-log be included through crates.io due to being in a separate repo makes changes to them harder too. Merging all programs in the base system into a single repo will make it easier to make atomic changes. It is also currently not safe for users to update individual packages that are part of the base system.
15 | 
16 | # Detailed design
17 | [design]: #detailed-design
18 | 
19 | The following repos will be merged into a single "base" repo:
20 | 
21 | * audiod
22 | * contain
23 | * drivers
24 | * event
25 | * init
26 | * initfs
27 | * ipcd
28 | * logd
29 | * netstack
30 | * ptyd
31 | * ramfs
32 | * randd
33 | * redox-log
34 | * redox-scheme
35 | * zerod
36 | 
37 | The following repos will **not** be merged into the "base" repo:
38 | 
39 | * binutils
40 | * bootloader (the bootloader interface rarely changes)
41 | * coreutils
42 | * dash
43 | * findutils
44 | * installer
45 | * ion
46 | * libredox (this is supposed to be a stable API in the future)
47 | * pkgar
48 | * pkgutils
49 | * redoxerd
50 | * uutils
51 | * pretty much everything outside of the core category in the cookbook
52 | 
53 | For the actual merge, I propose to make for each repo a commit which moves the entire content to a subdirectory and then git merge all the repos together into a new repo. This preserves the full history of all repos as well as git blame. And afterwards all issues will have to be transfered to the new repo.
54 | 
55 | After merging the repos, initially all recipes can be updated to use the `source.same_as` functionality that drivers-initfs already uses to have a single checkout for the base repo across all recipes and then build the respective subdirectory of the base repo. Once that is done, other MRs can start getting merged again as usual.
56 | 
57 | At a later point we can start merging recipes for base system components together and adapt the build step as appropriate. This will also enable sharing compiled dependencies between components if we put them in the same cargo workspace. In the end we probably want to either end up with either a single base package or a base package and a base-desktop package where the latter would contain the audio and graphics subsystem. Or alternatively we could end up with a base-server and base-desktop package which contain all components that overlap between both configurations to ensure the base system is atomically updated.
58 | 
59 | # Drawbacks
60 | [drawbacks]: #drawbacks
61 | 
62 | It is a non-trivial amount of work.
63 | 
64 | # Alternatives
65 | [alternatives]: #alternatives
66 | 
67 | What other designs have been considered? What is the impact of not doing this?
68 | 
69 | # Unresolved questions
70 | [unresolved]: #unresolved-questions
71 | 
72 | Should the following repos be merged into the "base" repo? I think they should, but it might be less disruptive to keep them in separate repos at least for the time being.
73 | 
74 | * bootstrap
75 | * escalated
76 | * kernel
77 | * relibc
78 | * redoxfs
79 | * syscall
80 | * orbital (but not the gui apps themself)
81 | 
82 | @4lDO2 prefers putting those in submodules instead for another repo.
83 | 
84 | ---
85 | 
86 | Should we split the base package and if so should we split it into base and base-desktop or base-server and base-desktop?
87 | 


--------------------------------------------------------------------------------
/text/0008-userspace-signals.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: userspace_signals
  2 | - Start Date: 2024-02-16
  3 | - RFC PR: https://gitlab.redox-os.org/redox-os/rfcs/-/merge_requests/19
  4 | - Redox Issue: https://gitlab.redox-os.org/redox-os/kernel/-/issues/113
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Most of Redox's POSIX signal handling implementation can be moved to userspace, in a way that in particular, allows changing the signal mask without any syscalls.
 10 | Sending signals would still be done by the kernel, with regular mutex synchronization, or later, a userspace process manager.
 11 | 
 12 | # Motivation
 13 | [motivation]: #motivation
 14 | 
 15 | Signals are the userspace equivalent of hardware interrupts -- they interrupt what was previously running, use the same stack, and are mostly maskable.
 16 | POSIX requires sigprocmask, which most POSIX systems implement in the kernel, using a sigprocmask syscall.
 17 | 
 18 | However, as Redox is moving more and more of previous kernel functionality to redox-rt, it sometimes needs critical sections where signals must not be delivered.
 19 | This is currently the case for _open(3)_, where it would be useful not to need to wrap each open call in two sigprocmask syscalls.
 20 | The same issue will become more significant in the future, should redox-rt for example emulate POSIX file descriptors.
 21 | 
 22 | Moving the signal implementation to userspace will eliminate the need for the `sigprocmask`, `sigaction`, and `sigreturn` syscalls.
 23 | The redox-rt counterparts will also likely be faster, when no longer syscall-based.
 24 | Although, there will still need to exist a `kill`/`sigqueue` syscall, or an equivalent IPC call to a process manager.
 25 | 
 26 | # Detailed design
 27 | [design]: #detailed-design
 28 | 
 29 | ## Proc scheme API
 30 | 
 31 | The kernel would remove `/<proc>/sigstack` and change `/<proc>/sighandler`, which would provide write-only access to the following struct:
 32 | 
 33 | ```rust
 34 | #[repr(C)]
 35 | struct SigEntry {
 36 |     user_handler: usize,
 37 |     excp_handler: usize,
 38 |     thread_ctl_region: usize,
 39 |     proc_ctl_region: usize,
 40 | }
 41 | ```
 42 | 
 43 | The `user_handler` and `excp_handler` fields are function pointers to the signal trampoline and CPU exception handler, respectively.
 44 | The `thread_ctl_region` and `proc_ctl_region` fields point to a control structure defined below, for thread vs process granularity.
 45 | Setting `user_handler` to zero disables user-handled signals completely for the thread.
 46 | The `excp_handler` field is however optional, and by default, CPU exceptions will result in core dumps unless explicitly handled.
 47 | The thread and process control region is defined as follows:
 48 | 
 49 | ```rust
 50 | #[repr(C)]
 51 | struct SigCtlRegion {
 52 |     // Consists of two words, one for standard and one for realtime signals.
 53 |     // The low 32 bits are the pending set, whereas the high bits are the allowset.
 54 |     ctl: [AtomicU64; 2],
 55 | 
 56 |     local_ctl: SigatomicU64, // accessed using Relaxed and compiler barriers
 57 | 
 58 |     old_ip: usize,              // eip/rip/pc
 59 |     old_archdep_reg: usize,     // eflags/rflags/x0
 60 | }
 61 | 
 62 | const LOCAL_CTL_INHIBIT_DELIVERY_BIT: u64 = 1;
 63 | 
 64 | #[repr(C)]
 65 | struct ProcCtlRegion {
 66 |     pending: AtomicU64,
 67 |     actions: [RawAction; 64],
 68 |     q: [RtSig; 32],
 69 |     qhead: AtomicU8,
 70 |     qtail: AtomicU8,
 71 | }
 72 | 
 73 | #[repr(C)]
 74 | struct RawAction {
 75 |     first: AtomicU64,
 76 |     user_data: AtomicU64,
 77 | }
 78 | 
 79 | #[repr(C)]
 80 | struct RtSig {
 81 |     signo: usize, // all bits except 6:0 are reserved
 82 |     sigval: usize,
 83 | }
 84 | ```
 85 | 
 86 | The signal control regions, for both process and thread, must be contained within a single page, and will 16-byte alignment.
 87 | Userspace is encouraged to reuse the existing TCB page.
 88 | The ctl field consists of two _signal groups_, namely the standard and realtime signals (starting at 33).
 89 | Each such `AtomicU64` is divided into a lower _pending set_ and upper _allowset_ half.
 90 | 
 91 | Since signal 0 does not exist, this bitset is zero-based.
 92 | The `sigprocmask` and `pthread_sigmask` functions, as defined by POSIX, can only modify the current thread's mask.
 93 | 
 94 | ## Kernel implementation of kill/sigqueue
 95 | 
 96 | The kill and/or sigqueue syscalls will still use mutex-based synchronization, with other kernel hardware threads.
 97 | This implies the only lock-free synchronization is check-mask-then-deliver and unmask-then-check-pending, where userspace synchronizes with the kernel, possibly on another hardware thread.
 98 | 
 99 | The kernel will set the corresponding pending bit, followed by reading the masked and pending bit simultaneously, the logical AND of which, is the set of deliverable signals.
100 | If nonempty, the kernel will unblock the thread and set an internal flag indicating that a signal is incoming.
101 | 
102 | Sending signals to a process rather than thread, is done by first setting the bit in the process-wide pending set, followed by linearly searching the TCBs for a thread that has not blocked that signal.
103 | A subsequent no-op write to that thread's `ctl` with Release ordering, should synchronize that earlier write to the process-wide set, when the trampoline later reads its thread-specific `ctl` (with Acquire ordering) followed by reading the process-wide mask, and deciding which signal to deliver.
104 | 
105 | ### Delivery
106 | 
107 | When the kernel delivers a signal, it unblocks the thread potentially sending an IPI, and when the context is switched to, it has exclusive access to the saved registers.
108 | The instruction and stack pointers, as well as some miscellaneous registers, are saved (using nonatomic accesses) to the respective fields of the thread control region.
109 | It will also set the _inhibit flag_.
110 | 
111 | The _inhibit flag_ allows temporarily preventing the kernel from jumping the userspace context to the signal trampoline, without affecting how threads are awoken etc.
112 | This flag allows async-signal-safe functions to easily and efficiently disabling signals during short critical sections.
113 | 
114 | #### Trampoline
115 | 
116 | The `user_handler` field, points to the _signal trampoline_.
117 | The kernel will save a few registers, some of which can be used as scratch registers.
118 | The trampoline will need to calculate the new stack pointer, taking into account the potential alternate signal stack (`sigaltstack`).
119 | On x86_64 this can be found [here](https://gitlab.redox-os.org/redox-os/relibc/-/blob/44f148ad6c214551bb0fddf0eecc9801558f25b9/redox-rt/src/arch/x86_64.rs#L174).
120 | 
121 | ## Fork, exec, pthread_create
122 | 
123 | POSIX requires fork to preserve all sigactions and the sigprocmask, which would likely be trivial considering the shared address space, as the proc scheme struct can simply be reapplied.
124 | 
125 | Exec must preserve the procmask, and remember which signals were ignored, but otherwise reset all nonignored sigactions.
126 | This information is passed in AT_SIGPROCMASK_{LO,HI}/AT_SIGIGNMASK_{LO,HI} with both lo and hi variants on 32-bit platforms.
127 | 
128 | `pthread_create` requires the pending set to start as empty, and the mask is inherited from the 'parent' thread.
129 | 
130 | ## SIGCONT, SIGSTOP(/SIGTSTP,SIGTTIN,SIGTTOU), SIGKILL
131 | 
132 | POSIX states that SIGKILL and SIGSTOP are not maskable, and cannot be handled in userspace or ignored.
133 | Luckily, this frees up 4 additional bits.
134 | The "SIGKILL masked", "SIGKILL pending", and "SIGSTOP masked" bits, instead indicate whether the action for SIGTSTP, SIGTTIN, and SIGTTOU, is equivalent to the hardcoded SIGSTOP action.
135 | If they are unset, they can be ignored or handled, and masked, like the other signals.
136 | 
137 | The SIGCONT signal cannot be ignored either, in the sense that sending a SIGCONT will always continue the target process, but userspace can choose whether or not it will be caught.
138 | The pending and masked bits for SIGCONT will thus have the same behavior as regular signals, except `kill` will unconditionally transition the thread from _stopped_ to _blocked_ first.
139 | 
140 | POSIX requires the generation of SIGCONT to discard all pending stop signals, and vice versa.
141 | Since the `kill` implementation is mutex-synchronized, this should be relatively easy to synchronize.
142 | 
143 | ## Realtime signals
144 | 
145 | POSIX specifies that the conventional signals (typically 1-31) should be implemented as a set; that is, multiple undelivered signals of the same number, should be merged into one.
146 | Realtime signals must instead be implemented as a queue, allowing multiple independent signals of the same number.
147 | Realtime signals must also be able to provide a value, either an `int` or a pointer.
148 | 
149 | This would be implemented using the `queue` field in the signal control region.
150 | The `qhead` and `qtail` atomic fields together divide the queue array into a consumer-owned and producer-owned half.
151 | The consumer half shall thus be read nonatomically by the thread, and the producer half written nonatomically by the kernel.
152 | The `qtail` field shall be written only by the producer (kernel), and the `qhead` field by the consumer (thread).
153 | 
154 | Since the consumer half is exclusively owned by the thread, it can dynamically update the pending bits accordingly based on which nonmasked realtime signals are present in the queue.
155 | The kernel will set the pending bits as usual when sending realtime signals, which for synchronization reasons, must be done _after_ the actual entry is enqueued.
156 | 
157 | This iteration will likely be very quick, so long as the number of possible signals does not significantly increase.
158 | Should this become a performance issue, the queue array may be converted to a structure-of-arrays, where the signal numbers can be packed, and counted quickly using SIMD.
159 | POSIX only requires the implementation to provide 8 realtime signals, and the threading implementation requires two additional signals (cancellation and timer).
160 | 
161 | ## raise
162 | 
163 | Raise will initially be implemented using a regular thread-specific kill syscall, but that should be possible to bypass.
164 | 
165 | ## sigprocmask/pthread_sigmask
166 | 
167 | Changing the signal mask (equivalently, the inverted allowset), is done fully in userspace.
168 | The inhibit bit is set, the allowsets for each group are atomically swapped while simultaneously reading the pending set, and there are pending unblocked signals, then at least one will be delivered before the *mask function returns.
169 | 
170 | Since the allowset strictly is writable only by the target thread, it can be modified without necessitating a CAS loop, on x86 which supports XADD (atomic fetch_add).
171 | Specifically, swapping only the allowset is done using `word.fetch_add(new_allowset.wrapping_sub(old_allowset))`.
172 | 
173 | ## sigaction
174 | 
175 | Sigaction will be implemented entirely in redox-rt.
176 | With signals temporarily disabled in the `sigaction` function itself, it can use regular mutex-based synchronization, including synchronization between `sigaction` and the signal trampoline running on other threads.
177 | Setting the action to SIG_IGN will modify the ignmask and clear the allowset bit for the respective signal.
178 | Setting it to SIG_DFL will either modify the signal-is-stop bits for SIGTSTP/SIGTTIN/SIGTTOU, or set it to a builtin default handler.
179 | 
180 | POSIX allows the sigaction to change between the generation and delivery of a signal, allowing sigaction to be weakly ordered and only synchronize against the signal trampoline (see [mem-orderings][mem-orderings]).
181 | 
182 | # sigwait, sigsuspend, etc
183 | 
184 | POSIX does not appear to differentiate between accepting (i.e. `sigwait`ing) and delivering (i.e. trampoline runs).
185 | Thus, it should be valid to implement these internally with a few additional checks in the trampoline.
186 | That said, these functions can avoid the trampoline entirely, by using the inhibit bit and simply catching `EINTR`, which is what the current implementation does.
187 | 
188 | # Drawbacks
189 | [drawbacks]: #drawbacks
190 | 
191 | This obviously adds complexity, and partially blurs the line between userspace and kernel.
192 | The TCB will also significantly grow in size, some of which also needs to store pthread information.
193 | However, from a microkernel perspecive, it would be useful to move as much of POSIX logic as possible to userspace.
194 | It would also likely improve the performance of signals, even compared to existing monolithic kernels such as Linux.
195 | 
196 | Since the sigaltstack logic is done in the trampolines, there's a large amount of assembly, but not significantly more than other libcs' signal trampolines on monolithic kernels.
197 | 
198 | # Alternatives
199 | [alternatives]: #alternatives
200 | 
201 | The kernel already has a basic signal implementation in the kernel.
202 | It would be possible to extend this to include realtime signals, sigwait/sigsuspend, and implement all the sigaction flags.
203 | However, this will severely limit the ability for redox-rt to quickly protect critical sections in async-signal-safe functions.
204 | 
205 | Alternatively, it would be possible to only implement the logic behind the _inhibit_ bit.
206 | That said, this breaks down if larger critical sections are used, that may internally block.
207 | In those cases, sigprocmask would likely be used to disable all signals inside that section, which would suffer from the same base syscall latency (usually a few hundred cycles), and this needs to be called twice.
208 | 
209 | # Unresolved questions
210 | [unresolved]: #unresolved-questions
211 | 
212 | ## Who should send the signals?
213 | 
214 | In this proposal, the kernel will be responsible for sending the signals, and userspace will merely control sigprocmasking, and raising some signals on its own.
215 | However, if performance is not sufficiently significant for this to stay in the kernel, it might make sense for _kill(3)_ and/or _sigqueue(3)_ to be implemented as a (fast synchronous) IPC call to the process manager.
216 | A process manager has been suggested as a way for the kernel to only abstract _contexts_, and let that manager define _threads_, _processes_, _sessions_, and _process groups_.
217 | 
218 | ## `siginfo_t`
219 | 
220 | POSIX, at least provided Redox will support XSI, requires extra information to be obtainable for each signal.
221 | This is highly likely possible to pass in an array per signal, using the pending bits as synchronization, but may require realtime signals to be queued internally in the kernel.
222 | 
223 | ## Memory orderings
224 | 
225 | [mem-orderings]: #mem-orderings
226 | 
227 | Acquire+Release+Relaxed should be sufficient, but for now it is on a correctness basis assumed that the implementation uses SeqCst.
228 | 
229 | # Acknowledgements
230 | 
231 | This RFC has been developed as part of the _Unix-style Signals_ project. The project is funded through [NGI Zero Core](https://nlnet.nl/core), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu) program. Learn more at the [NLnet project page](https://nlnet.nl/project/RedoxOS-Signals).
232 | 
233 | [<img src="https://nlnet.nl/logo/banner.png" alt="NLnet foundation logo" width="20%" />](https://nlnet.nl)
234 | [<img src="https://nlnet.nl/image/logos/NGI0_tag.svg" alt="NGI Zero Logo" width="20%" />](https://nlnet.nl/core)
235 | 


--------------------------------------------------------------------------------
/text/0009-namespace-scheme.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: namespace-scheme
  2 | - Start Date: 2024-01-17
  3 | - RFC PR: (leave this empty)
  4 | - Redox Issue: (leave this empty)
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | As discussed between jeremy_soller and rw_van.
 10 | 
 11 | Adopting the new scheme naming format `/scheme/scheme_name` creates an ambiguity when mounting a scheme in the effective namespace. Mounting a scheme will change from `open(":scheme_name", O_CREAT)` to
 12 | 
 13 | ```
 14 | open("/scheme/namespace/scheme_name", O_CREAT | O_EXCL)
 15 | ```
 16 | 
 17 | (See [Unresolved questions](#unresolved-questions) for naming options and issues.)
 18 | 
 19 | # Motivation
 20 | [motivation]: #motivation
 21 | 
 22 | Schemes exist within a namespace. Currently, a scheme is referred to as `scheme_name:`. The namespace manager is `RootScheme` in the kernel, and it is addressed using a scheme name of `:`, which is effectively an empty name `""` followed by a colon `:` separator. Creating/mounting a scheme is done using the format `:scheme_name`, i.e. an empty name followed by a separator and the scheme name as a path. This allows path parsing to naturally detect references to the root scheme, and to pass the scheme name to the root scheme for mounting.
 23 | 
 24 | The change to a naming format of `/scheme/scheme_name` make it so a request to the root scheme to mount `scheme_name` cannot be naturally parsed. There is no separator that indicates an empty scheme name.
 25 | 
 26 | # Detailed design
 27 | [design]: #detailed-design
 28 | 
 29 | ## Behavior
 30 | 
 31 | 1. Currently, the effective namespace is referred to using `":"`, i.e. an empty name `""` followed by a colon separator `":"`. The proposed new name for the effective namespace is `/scheme/namespace`. 
 32 | 
 33 | 2. Currently, mounting scheme `scheme_name` is done by `open(":scheme_name", O_CREAT)`. The new open call will be
 34 | 
 35 | ```
 36 | open("/scheme/namespace/scheme_name", O_CREAT | O_EXCL)
 37 | ```
 38 | 
 39 | This will result in the new scheme being mounted as `/scheme/scheme_name`.
 40 | 
 41 | 3. If the scheme has already been mounted and is healthy, a second create call will fail (`EEXIST`).
 42 | 
 43 | 4. `open("/scheme/namespace/scheme_name")` **without** `(O_CREAT | O_EXCL)` present will provide an fd that can be used to query or set **TBD** information about the namespace's view of the scheme. (See [Unresolved questions](#unresolved-questions).)
 44 | 
 45 | ## Changes
 46 | 
 47 | ### Kernel dispatch
 48 | 
 49 | 1. [kernel::syscall::fs::open](https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/syscall/fs.rs?ref_type=heads#L66) needs to translate the empty scheme name `""` to the new namespace scheme name `"namespace"`.
 50 | 
 51 | 2. The function [scheme::SchemeList::new_ns](https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/scheme/mod.rs?ref_type=heads#L169) inserts an empty string `""` into the list of schemes as a key, so that parsing a path with an empty scheme is naturally forwarded to the RootScheme. Changing this string to `"namespace"`, in combination with the changes to `open` above, will enable the new format.
 52 | 
 53 | ### redox-scheme
 54 | 
 55 | [redox_scheme::Socket::create_inner](https://gitlab.redox-os.org/redox-os/redox-scheme/-/blob/master/src/lib.rs?ref_type=heads#L129) should be updated as soon as possible to use the new format.
 56 | 
 57 | ### Scheme providers
 58 | 
 59 | `redox-scheme` is the best practice interface for schemes. Wherever possible, schemes should be updated to use redox-scheme rather than implementing the scheme protocol themselves.
 60 | 
 61 | For those schemes that cannot not use `redox-scheme`, the `open` call to create the scheme will need to be modified.
 62 | 
 63 | ### Contain
 64 | 
 65 | Contain should already be using redox-scheme and should therefore update automatically when redox-scheme is updated. However, this should be verified.
 66 | 
 67 | # Drawbacks
 68 | [drawbacks]: #drawbacks
 69 | 
 70 | There is no compelling reason to not do this.
 71 | 
 72 | # Alternatives
 73 | [alternatives]: #alternatives
 74 | 
 75 | ## Status Quo
 76 | 
 77 | Due to the change in scheme naming described in the "Scheme Path" RFC, there is an ambiguity in referencing the root scheme. Does `/scheme/s1` mean "open the null path on scheme "s1" or does it mean open "s1" on the root scheme? This is unnecessarily confusing and would require coding of `open` flags to resolve the ambiguity.
 78 | 
 79 | ## Special Files and Mount Points
 80 | 
 81 | Unix uses "special files", e.g. block-special and character-special files, to refer to physical devices and pseudo-devices. Special files are indicated by a filetype in their status flags, and the "major" and "minor" numbers are interpreted to determine the driver and specific resource (e.g. partition) for the special file. For filesystem providers, the special device can be "mounted" at a named point in the filesystem, e.g. the block-special device `/dev/nvme0n1p2` can be mounted at `/home`. This masks the file named `/home` and connects the filesystem to that point.
 82 | 
 83 | Conceivably, Redox could use a status bit to indicate a path that represents a "service provider" (scheme) that could be mounted, and then the "mount" system call could make the scheme accessible at some location in the filesystem. There would be a (mostly) one-to-one correspondence between what special files exist, and what schemes are in the namespace. This ultimately is very similar to the proposed mapping.
 84 | 
 85 | # Unresolved questions
 86 | [unresolved]: #unresolved-questions
 87 | 
 88 | 1. What should the name of the effective namespace scheme (RootScheme) be?
 89 | 
 90 |     - `/scheme/ns`
 91 |     - `/scheme/namespace` (see note below)
 92 |     - `/namespace` (treats namespace as a thing that is not a scheme)
 93 |     - `/scheme/ens` (`/scheme/rns` for "real namespace")
 94 | 
 95 | The recommended name in this RFC, `/scheme/namespace`, will be the choice once this RFC is approved.
 96 | 
 97 | 2. We need a naming format for distinct namespaces, e.g. `/scheme/namespaces/n` for namespace `n`. It should be separated from the naming of the effective namespace.
 98 | 
 99 |     - Having names for distinct namespaces would allow us to have a capability-based security mechanism where schemes can be inserted or removed from namespaces. It would also enable user-managed namespaces and schemes. Details to be discussed in another RFC.
100 |     
101 |     - Using a name that is distinct from the normal namespace, e.g. `/scheme/namespaces` (plural), would allow us to have arbitrary names for new namespaces. If we choose `/scheme/namespace/scheme_name` for mounting a scheme, and `/scheme/namespace/n` to refer to namespace `n`, then we are forced to use numbers rather than names for namespaces, and we will have schemes under the namespace folder at two different levels, creating confusion.
102 | 
103 | This is out of scope for this RFC.
104 | 
105 | 3. We need better management of namespaces, but it should be addressed in a separate RFC.
106 | 
107 | This is out of scope for this RFC.
108 | 
109 | 4. Should we change for `open` to `mount` when creating a new scheme? It includes additional flags and options.
110 | 
111 | This is out of scope for this RFC.
112 | 
113 | 5. Assume that when a scheme is restarted, it will call `open("/scheme/namespace/scheme_name", O_CREAT | O_EXCL)`. If those flags are not provided, then the open is to query the namespace about the state of the scheme. What actions can be performed on the resulting fd, and what information is provided?
114 | 
115 | This is out of scope for this RFC.
116 | 


--------------------------------------------------------------------------------