├── LICENSE ├── README └── v3.10 ├── 0001-net-tcp-TCP-with-Forward-Error-Correction-Common.patch ├── 0002-net-tcp-TCP-with-Forward-Error-Correction-Receiver.patch └── 0003-net-tcp-TCP-with-Forward-Error-Correction-Sender.patch /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | This repository hosts modifications to the Linux kernel to enable forward error 2 | correction (FEC) in TCP. The technique is described as the "Corrective" 3 | approach in our SIGCOMM 2013 publication titled "Reducing Web Latency: The 4 | Virtue of Gentle Aggression". 5 | 6 | The modifications were originally developed for the Linux kernel version 2.6.34 7 | and have since been rebased to version 3.10 - though without extensive testing! 8 | 9 | We invite you to play around with the patches and welcome your feedback. We are 10 | also happy to apply bugfixes and future rebases you might have to the existing 11 | patch set. 12 | 13 | WARNING: Since the modifications have not been tested extensively in the current 14 | kernel version, we advise you to execute tests in an isolated environment with 15 | an option for a recovery from kernel panics, etc. 16 | 17 | 18 | ### Patch components 19 | 20 | The changes are grouped into three patches building on top of each other: 21 | Common, Receiver, and Sender. For a detailed description of the parts 22 | implemented by each patch please check the description at the top of each patch 23 | file. 24 | 25 | 26 | ### Installation 27 | 28 | To get started fetch the Linux kernel version used as the base for the patch set 29 | (here version 3.10): 30 | 31 | $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 32 | $ cd linux 33 | $ git checkout tags/v3.10 34 | 35 | Next, check out the patches (or download them directly). Then apply them in the 36 | right order (Common, then Receiver, then Sender): 37 | 38 | $ git apply /0001-net-tcp-TCP-with-Forward-Error-Correction-Common.patch 40 | $ git apply /0002-net-tcp-TCP-with-Forward-Error-Correction-Receiver.patch 42 | $ git apply /0003-net-tcp-TCP-with-Forward-Error-Correction-Sender.patch 44 | 45 | All three `apply` calls do NOT produce any console output if applied correctly. 46 | 47 | NOTE: If you want to apply the patches to a different kernel version, keep in 48 | mind that the apply step can fail if some changes are conflicted with the 49 | changed code base. A possible solution is to apply a patch partially (using the 50 | `--reject` flag when running `git apply`) and then resolve the conflicts stored 51 | in `.rej` files manually. 52 | 53 | Finally, you need to compile and install the modified kernel. There are many 54 | tutorials on how to do this out there, so we avoid replicating instructions for 55 | this here. 56 | 57 | The FEC feature is turned off by default in an environment running with a 58 | modified kernel. To enable it run: 59 | 60 | $ sysctl net.ipv4.tcp_fec=1 61 | 62 | 63 | ### Testing 64 | 65 | We developed a set of packetdrill test routines to check the proper 66 | functionality of the FEC engine. We will publish them in this repository soon. 67 | 68 | If you have further questions, feel free to contact any of the contributors. 69 | -------------------------------------------------------------------------------- /v3.10/0001-net-tcp-TCP-with-Forward-Error-Correction-Common.patch: -------------------------------------------------------------------------------- 1 | From 1e117e7a24bddda7d53afe8da640ac4433f6450c Mon Sep 17 00:00:00 2001 2 | From: Tobias Flach 3 | Date: Mon, 25 Aug 2014 16:44:23 -0700 4 | Subject: [PATCH] net-tcp: TCP with Forward Error Correction (Common) 5 | 6 | Implemetation of the common part of forward error correction for server and 7 | client in TCP. 8 | 9 | Implemented components: 10 | * New sysctl to enable FEC (set to 1) 11 | * Additional socket buffer pointer for a struct storing FEC control parameters 12 | * FEC option encoding and decoding 13 | * Negotiation during connection setup: 14 | - The client requests FEC usage by adding an FEC option to the SYN 15 | packet. That is, option kind EXP (0xFE), two magic bytes to 16 | identify the FEC option (0xDC60), and one extra byte to identify 17 | the encoding type 18 | - Currently supported encoding types are: 19 | TCP_FEC_TYPE_XOR_ALL 1 XORs every MSS length segment 20 | TCP_FEC_TYPE_XOR_SKIP_1 2 XORs every other MSS length segment 21 | - If the server supports FEC, it copies the option over to the 22 | SYN/ACK packet. 23 | - Following a successful negotiation every packet carries an FEC 24 | option. Regular data packets and regular acknowledgements 25 | carry a short FEC option with a 1-byte value encoding various 26 | flags. Encoded packets carry the option with a 4-byte value, 27 | encoding the flags and the encoding range. Acknowledgements 28 | after a failed recovery carry the option with a 4-byte value, 29 | encoding the flags and the loss range. 30 | --- 31 | include/linux/skbuff.h | 4 +- 32 | include/linux/tcp.h | 38 ++++++++++++++ 33 | include/net/request_sock.h | 2 + 34 | include/net/tcp.h | 10 ++++ 35 | include/net/tcp_fec.h | 53 +++++++++++++++++++ 36 | include/uapi/linux/tcp.h | 1 + 37 | net/ipv4/Makefile | 2 +- 38 | net/ipv4/sysctl_net_ipv4.c | 9 ++++ 39 | net/ipv4/tcp.c | 10 ++++ 40 | net/ipv4/tcp_fec.c | 124 +++++++++++++++++++++++++++++++++++++++++++++ 41 | net/ipv4/tcp_input.c | 29 +++++++++++ 42 | net/ipv4/tcp_ipv4.c | 3 ++ 43 | net/ipv4/tcp_minisocks.c | 7 +++ 44 | net/ipv4/tcp_output.c | 45 ++++++++++++++++ 45 | net/ipv6/tcp_ipv6.c | 2 + 46 | 15 files changed, 337 insertions(+), 2 deletions(-) 47 | create mode 100644 include/net/tcp_fec.h 48 | create mode 100644 net/ipv4/tcp_fec.c 49 | 50 | diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h 51 | index dec1748..b652f1c 100644 52 | --- a/include/linux/skbuff.h 53 | +++ b/include/linux/skbuff.h 54 | @@ -418,8 +418,10 @@ struct sk_buff { 55 | * layer. Please put your private variables there. If you 56 | * want to keep them across layers you have to do a skb_clone() 57 | * first. This is owned by whoever has the skb queued ATM. 58 | + * 59 | + * Increased the CB to hold pointer to an FEC structure. 60 | */ 61 | - char cb[48] __aligned(8); 62 | + char cb[56] __aligned(8); 63 | 64 | unsigned long _skb_refdst; 65 | #ifdef CONFIG_XFRM 66 | diff --git a/include/linux/tcp.h b/include/linux/tcp.h 67 | index 5adbc33..b5f23d8 100644 68 | --- a/include/linux/tcp.h 69 | +++ b/include/linux/tcp.h 70 | @@ -77,6 +77,24 @@ struct tcp_sack_block { 71 | #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/ 72 | #define TCP_DSACK_SEEN (1 << 2) /*1 = DSACK was received from peer*/ 73 | 74 | +/* Flags transmitted in the first FEC option byte after magic bytes 75 | + * (except if option is used for negotiation) */ 76 | +#define TCP_FEC_RECOVERY_CWR 0x80 /* Recovery triggered CWR */ 77 | +#define TCP_FEC_RECOVERY_SUCCESSFUL 0x40 /* Local recovery done */ 78 | +#define TCP_FEC_RECOVERY_FAILED 0x20 /* Local recovery failed */ 79 | +#define TCP_FEC_ENCODED 0x10 /* Packet is FEC-encoded */ 80 | + 81 | +struct tcp_fec { 82 | + u8 type; /* Requested FEC type (negotiation only, 83 | + * see net/tcp_fec.h for type defs) */ 84 | + u32 enc_seq; /* Sequence number of first encoded byte */ 85 | + u32 enc_len; /* Encoding length */ 86 | + u32 lost_seq; /* Sequence number of first lost byte */ 87 | + u32 lost_len; /* Loss length */ 88 | + u8 flags; /* See flag definitions above */ 89 | + bool saw_fec; /* FEC option was retrieved from packet */ 90 | +}; 91 | + 92 | struct tcp_options_received { 93 | /* PAWS/RTTM data */ 94 | long ts_recent_stamp;/* Time we stored ts_recent (for aging) */ 95 | @@ -93,12 +111,14 @@ struct tcp_options_received { 96 | u8 num_sacks; /* Number of SACK blocks */ 97 | u16 user_mss; /* mss requested by user in ioctl */ 98 | u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ 99 | + struct tcp_fec fec; /* FEC-related parameters */ 100 | }; 101 | 102 | static inline void tcp_clear_options(struct tcp_options_received *rx_opt) 103 | { 104 | rx_opt->tstamp_ok = rx_opt->sack_ok = 0; 105 | rx_opt->wscale_ok = rx_opt->snd_wscale = 0; 106 | + memset(&(rx_opt->fec), 0, sizeof(struct tcp_fec)); 107 | } 108 | 109 | /* This is the max number of SACKS that we'll generate and process. It's safe 110 | @@ -321,6 +341,24 @@ struct tcp_sock { 111 | * socket. Used to retransmit SYNACKs etc. 112 | */ 113 | struct request_sock *fastopen_rsk; 114 | + 115 | +/* TCP FEC parameters 116 | + * type - negotiated FEC type to be used 117 | + * next_seq - next sequence which was not FEC-encoded before 118 | + * lost_len - bytes after rcv_nxt considered lost 119 | + * flags - see TCP_FEC_* flag definitions above 120 | + * bytes_rcv_queue - number of bytes stored in queued SKBs 121 | + * rcv_queue - copies from the socket's receive queue kept for 122 | + * FEC recovery 123 | + */ 124 | + struct { 125 | + u8 type; 126 | + u32 next_seq; 127 | + u32 lost_len; 128 | + u8 flags; 129 | + u32 bytes_rcv_queue; 130 | + struct sk_buff_head rcv_queue; 131 | + } fec; 132 | }; 133 | 134 | enum tsq_flags { 135 | diff --git a/include/net/request_sock.h b/include/net/request_sock.h 136 | index 59795e4..06705c2 100644 137 | --- a/include/net/request_sock.h 138 | +++ b/include/net/request_sock.h 139 | @@ -62,6 +62,8 @@ struct request_sock { 140 | struct sock *sk; 141 | u32 secid; 142 | u32 peer_secid; 143 | + u8 fec_type; /* Encoding type (see 144 | + * net/tcp_fec.h) */ 145 | }; 146 | 147 | static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *ops) 148 | diff --git a/include/net/tcp.h b/include/net/tcp.h 149 | index 5bba80f..9a949a9 100644 150 | --- a/include/net/tcp.h 151 | +++ b/include/net/tcp.h 152 | @@ -184,6 +184,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); 153 | * experimental options. See draft-ietf-tcpm-experimental-options-00.txt 154 | */ 155 | #define TCPOPT_FASTOPEN_MAGIC 0xF989 156 | +#define TCPOPT_FEC_MAGIC 0xDC60 157 | 158 | /* 159 | * TCP option lengths 160 | @@ -199,6 +200,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); 161 | #define TCPOLEN_COOKIE_PAIR 3 /* Cookie pair header extension */ 162 | #define TCPOLEN_COOKIE_MIN (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN) 163 | #define TCPOLEN_COOKIE_MAX (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX) 164 | +#define TCPOLEN_EXP_FEC_BASE 4 165 | 166 | /* But this is what stacks really send out. */ 167 | #define TCPOLEN_TSTAMP_ALIGNED 12 168 | @@ -209,6 +211,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); 169 | #define TCPOLEN_SACK_PERBLOCK 8 170 | #define TCPOLEN_MD5SIG_ALIGNED 20 171 | #define TCPOLEN_MSS_ALIGNED 4 172 | +#define TCPOLEN_EXP_FEC_NEGOTIATION_ALIGNED 8 173 | 174 | /* Flags in tp->nonagle */ 175 | #define TCP_NAGLE_OFF 1 /* Nagle's algo is disabled */ 176 | @@ -240,6 +243,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo); 177 | * cookie/data not present. (For testing purpose!) 178 | */ 179 | #define TFO_SERVER_ALWAYS 0x1000 180 | +/* Maximum number of in-order bytes kept in the receiver's buffer for FEC 181 | + * recoveries. The sender will never send more than this in a single FEC 182 | + * packet. */ 183 | +#define FEC_RCV_QUEUE_LIMIT 16000 184 | 185 | extern struct inet_timewait_death_row tcp_death_row; 186 | 187 | @@ -287,6 +294,7 @@ extern int sysctl_tcp_thin_dupack; 188 | extern int sysctl_tcp_early_retrans; 189 | extern int sysctl_tcp_limit_output_bytes; 190 | extern int sysctl_tcp_challenge_ack_limit; 191 | +extern int sysctl_tcp_fec; 192 | 193 | extern atomic_long_t tcp_memory_allocated; 194 | extern struct percpu_counter tcp_sockets_allocated; 195 | @@ -713,6 +721,7 @@ struct tcp_skb_cb { 196 | __u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */ 197 | /* 1 byte hole */ 198 | __u32 ack_seq; /* Sequence number ACK'd */ 199 | + struct tcp_fec *fec; /* FEC parameters */ 200 | }; 201 | 202 | #define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0])) 203 | @@ -1093,6 +1102,7 @@ static inline void tcp_openreq_init(struct request_sock *req, 204 | ireq->ecn_ok = 0; 205 | ireq->rmt_port = tcp_hdr(skb)->source; 206 | ireq->loc_port = tcp_hdr(skb)->dest; 207 | + req->fec_type = rx_opt->fec.type; 208 | } 209 | 210 | /* Compute time elapsed between SYNACK and the ACK completing 3WHS */ 211 | diff --git a/include/net/tcp_fec.h b/include/net/tcp_fec.h 212 | new file mode 100644 213 | index 0000000..ba219d1 214 | --- /dev/null 215 | +++ b/include/net/tcp_fec.h 216 | @@ -0,0 +1,53 @@ 217 | +#ifndef _TCP_FEC_H 218 | +#define _TCP_FEC_H 219 | + 220 | +#include 221 | +#include 222 | + 223 | +/* FEC-encoding types (8 bits, internal) */ 224 | +#define TCP_FEC_TYPE_NONE 0 /* FEC disabled */ 225 | +#define TCP_FEC_TYPE_XOR_ALL 1 /* XOR every MSS length segment */ 226 | +#define TCP_FEC_TYPE_XOR_SKIP_1 2 /* XOR every other MSS length 227 | + * segment */ 228 | + 229 | +#define TCP_FEC_NUM_TYPES 3 230 | + 231 | +/* 232 | + * Returns true if FEC is enabled for the socket 233 | + */ 234 | +static inline bool tcp_fec_is_enabled(const struct tcp_sock *tp) 235 | +{ 236 | + return unlikely(tp->fec.type > 0); 237 | +} 238 | + 239 | +/* 240 | + * Returns true if the current packet in the buffer is FEC-encoded 241 | + */ 242 | +static inline bool tcp_fec_is_encoded(const struct tcp_sock *tp) 243 | +{ 244 | + return unlikely((tp->rx_opt.fec.flags & TCP_FEC_ENCODED) && 245 | + (tp->rx_opt.fec.saw_fec)); 246 | +} 247 | + 248 | +/* 249 | + * Decodes FEC parameters and stores them in the FEC struct 250 | + * @seq - sequence number of the packet 251 | + * @ack_seq - ACKed sequence number 252 | + * @is_syn - true, if option was attached to a packet with a SYN flag 253 | + * @ptr - points to the first byte of the FEC option after kind, length, 254 | + * and possible magic bytes 255 | + * @len - option length (without kind, length, magic bytes) 256 | + */ 257 | +int tcp_fec_decode_option(struct tcp_fec *fec, u32 seq, u32 ack_seq, 258 | + bool is_syn, const unsigned char *ptr, 259 | + unsigned int len); 260 | + 261 | +/* 262 | + * Encodes FEC parameters to wire format 263 | + * Pointer points to the first byte of the FEC option after kind, length, 264 | + * and possible magic bytes (pointer will be moved to first unoccupied byte) 265 | + */ 266 | +int tcp_fec_encode_option(struct tcp_sock *tp, struct tcp_fec *fec, 267 | + __be32 **ptr); 268 | + 269 | +#endif 270 | diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h 271 | index 8d776eb..15e3aba 100644 272 | --- a/include/uapi/linux/tcp.h 273 | +++ b/include/uapi/linux/tcp.h 274 | @@ -111,6 +111,7 @@ enum { 275 | #define TCP_REPAIR_OPTIONS 22 276 | #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */ 277 | #define TCP_TIMESTAMP 24 278 | +#define TCP_FEC 25 /* Forward error correction */ 279 | 280 | struct tcp_repair_opt { 281 | __u32 opt_code; 282 | diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile 283 | index 089cb9f..7ec6035 100644 284 | --- a/net/ipv4/Makefile 285 | +++ b/net/ipv4/Makefile 286 | @@ -6,7 +6,7 @@ obj-y := route.o inetpeer.o protocol.o \ 287 | ip_input.o ip_fragment.o ip_forward.o ip_options.o \ 288 | ip_output.o ip_sockglue.o inet_hashtables.o \ 289 | inet_timewait_sock.o inet_connection_sock.o \ 290 | - tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ 291 | + tcp.o tcp_fec.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ 292 | tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \ 293 | datagram.o raw.o udp.o udplite.o \ 294 | arp.o icmp.o devinet.o af_inet.o igmp.o \ 295 | diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c 296 | index fa2f63f..42ea051 100644 297 | --- a/net/ipv4/sysctl_net_ipv4.c 298 | +++ b/net/ipv4/sysctl_net_ipv4.c 299 | @@ -771,6 +771,15 @@ static struct ctl_table ipv4_table[] = { 300 | .proc_handler = proc_dointvec_minmax, 301 | .extra1 = &one 302 | }, 303 | + { 304 | + .procname = "tcp_fec", 305 | + .data = &sysctl_tcp_fec, 306 | + .maxlen = sizeof(int), 307 | + .mode = 0644, 308 | + .proc_handler = proc_dointvec, 309 | + .extra1 = &zero, 310 | + .extra2 = &one, 311 | + }, 312 | { } 313 | }; 314 | 315 | diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c 316 | index ab450c0..a243f86 100644 317 | --- a/net/ipv4/tcp.c 318 | +++ b/net/ipv4/tcp.c 319 | @@ -276,6 +276,7 @@ 320 | #include 321 | #include 322 | #include 323 | +#include 324 | 325 | #include 326 | #include 327 | @@ -2624,6 +2625,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level, 328 | else 329 | tp->tsoffset = val - tcp_time_stamp; 330 | break; 331 | + case TCP_FEC: 332 | + if (sysctl_tcp_fec && val >= 0 && val < TCP_FEC_NUM_TYPES) 333 | + tp->fec.type = val; 334 | + else 335 | + err = -EINVAL; 336 | + break; 337 | default: 338 | err = -ENOPROTOOPT; 339 | break; 340 | @@ -2840,6 +2847,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, 341 | case TCP_TIMESTAMP: 342 | val = tcp_time_stamp + tp->tsoffset; 343 | break; 344 | + case TCP_FEC: 345 | + val = tp->fec.type; 346 | + break; 347 | default: 348 | return -ENOPROTOOPT; 349 | } 350 | diff --git a/net/ipv4/tcp_fec.c b/net/ipv4/tcp_fec.c 351 | new file mode 100644 352 | index 0000000..97a48ce 353 | --- /dev/null 354 | +++ b/net/ipv4/tcp_fec.c 355 | @@ -0,0 +1,124 @@ 356 | +#include 357 | + 358 | +/* Decodes FEC parameters and stores them in the FEC struct 359 | + * @seq - sequence number of the packet 360 | + * @ack_seq - ACKed sequence number 361 | + * @is_syn - true, if option was attached to a packet with a SYN flag 362 | + * @ptr - points to the first byte of the FEC option after kind, length, 363 | + * and possible magic bytes 364 | + * @len - option length (without kind, length, magic bytes) 365 | + */ 366 | +int tcp_fec_decode_option(struct tcp_fec *fec, u32 seq, u32 ack_seq, 367 | + bool is_syn, const unsigned char *ptr, 368 | + unsigned int len) 369 | +{ 370 | + /* reset / initialize option values which should be evaluated 371 | + * with EVERY incoming packet 372 | + */ 373 | + fec->flags = 0; 374 | + fec->saw_fec = 1; 375 | + 376 | + if (len == 1) { 377 | + /* Short option */ 378 | + u8 val = *((u8 *) ptr); 379 | + if (is_syn) { 380 | + /* Negotiation */ 381 | + fec->type = val; 382 | + } else { 383 | + /* Regular packet */ 384 | + fec->flags = val; 385 | + } 386 | + 387 | + return 0; 388 | + } 389 | + 390 | + if (len == 4) { 391 | + /* Long option */ 392 | + u32 val = get_unaligned_be32(ptr); 393 | + fec->flags = val >> 24; 394 | + 395 | + if (fec->flags & TCP_FEC_ENCODED) { 396 | + fec->enc_seq = seq; 397 | + fec->enc_len = val & 0xFFFFFF; 398 | + } else if (fec->flags & TCP_FEC_RECOVERY_FAILED) { 399 | + fec->lost_seq = ack_seq; 400 | + fec->lost_len = val & 0xFFFFFF; 401 | + } else { 402 | + return -EINVAL; 403 | + } 404 | + 405 | + return 0; 406 | + } 407 | + 408 | + /* Invalid option length */ 409 | + return -EINVAL; 410 | +} 411 | + 412 | +/* Encodes FEC parameters to wire format 413 | + * @ptr - Encoded option is written to this memory location (and the pointer 414 | + * is advanced to the next unoccupied byte, 4-byte aligned) 415 | + * Returns the length of the encoded option (including alignment) 416 | + */ 417 | +int tcp_fec_encode_option(struct tcp_sock *tp, struct tcp_fec *fec, 418 | + __be32 **ptr) 419 | +{ 420 | + int len; 421 | + 422 | + fec->flags |= tp->fec.flags; 423 | + fec->lost_len = tp->fec.lost_len; 424 | + tp->fec.flags &= ~TCP_FEC_RECOVERY_CWR; 425 | + tp->fec.flags &= ~TCP_FEC_RECOVERY_FAILED; 426 | + 427 | + /* Encode fixed option part (option kind, length, and magic bytes) */ 428 | + if (fec->flags & (TCP_FEC_ENCODED | TCP_FEC_RECOVERY_FAILED)) 429 | + len = 4 + TCPOLEN_EXP_FEC_BASE; /* Long option */ 430 | + else 431 | + len = 1 + TCPOLEN_EXP_FEC_BASE; /* Short option */ 432 | + 433 | + **ptr = htonl((TCPOPT_EXP << 24) | (len << 16) | TCPOPT_FEC_MAGIC); 434 | + (*ptr)++; 435 | + 436 | + if ((fec->flags & TCP_FEC_ENCODED) && 437 | + (fec->flags & TCP_FEC_RECOVERY_FAILED)) { 438 | + /* TODO Special case: need to separate loss indication 439 | + * from encoding or make option 12 bytes long 440 | + * This can only happen if a node receives and sends FEC 441 | + * data 442 | + */ 443 | + fec->flags &= ~TCP_FEC_RECOVERY_FAILED; 444 | + } 445 | + 446 | + if (fec->flags & TCP_FEC_ENCODED) { 447 | + /* FEC-encoded packets carry: 448 | + * 449 | + */ 450 | + **ptr = htonl((fec->flags << 24) | 451 | + (fec->enc_len)); 452 | + (*ptr)++; 453 | + return 8; 454 | + } else if (fec->flags & TCP_FEC_RECOVERY_FAILED) { 455 | + /* Packets with failed recovery indication carry: 456 | + * 457 | + */ 458 | + **ptr = htonl((fec->flags << 24) | 459 | + (fec->lost_len)); 460 | + (*ptr)++; 461 | + return 8; 462 | + } else if (fec->type) { 463 | + /* Negotiation packets carry: */ 464 | + **ptr = htonl((fec->type << 24) | 465 | + (TCPOPT_NOP << 16) | 466 | + (TCPOPT_NOP << 8) | 467 | + TCPOPT_NOP); 468 | + (*ptr)++; 469 | + return 8; 470 | + } else { 471 | + /* All other packets carry: */ 472 | + **ptr = htonl((fec->flags << 24) | 473 | + (TCPOPT_NOP << 16) | 474 | + (TCPOPT_NOP << 8) | 475 | + TCPOPT_NOP); 476 | + (*ptr)++; 477 | + return 8; 478 | + } 479 | +} 480 | diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c 481 | index 9c62257..3260498 100644 482 | --- a/net/ipv4/tcp_input.c 483 | +++ b/net/ipv4/tcp_input.c 484 | @@ -70,6 +70,7 @@ 485 | #include 486 | #include 487 | #include 488 | +#include 489 | #include 490 | #include 491 | #include 492 | @@ -3564,6 +3565,20 @@ void tcp_parse_options(const struct sk_buff *skb, 493 | break; 494 | #endif 495 | case TCPOPT_EXP: 496 | + /* TCP FEC option shares code 254 using a 497 | + * 16 bit magic number. 498 | + */ 499 | + if (sysctl_tcp_fec && 500 | + get_unaligned_be16(ptr) == 501 | + TCPOPT_FEC_MAGIC) { 502 | + tcp_fec_decode_option(&(opt_rx->fec), 503 | + ntohl(th->seq), 504 | + ntohl(th->ack_seq), th->syn, 505 | + ptr + 2, 506 | + opsize - TCPOLEN_EXP_FEC_BASE); 507 | + break; 508 | + } 509 | + 510 | /* Fast Open option shares code 254 using a 511 | * 16 bits magic number. It's valid only in 512 | * SYN or SYN-ACK with an even size. 513 | @@ -5093,6 +5108,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb, 514 | */ 515 | 516 | tp->rx_opt.saw_tstamp = 0; 517 | + tp->rx_opt.fec.saw_fec = 0; 518 | 519 | /* pred_flags is 0xS?10 << 16 + snd_wnd 520 | * if header_prediction is to be made 521 | @@ -5463,6 +5479,15 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, 522 | if (tcp_is_sack(tp) && sysctl_tcp_fack) 523 | tcp_enable_fack(tp); 524 | 525 | + /* 526 | + * FEC negotiation 527 | + * Disable FEC if both ends do not agree on the FEC type used 528 | + */ 529 | + if (tp->fec.type != tp->rx_opt.fec.type) { 530 | + tp->fec.type = 0; 531 | + tp->rx_opt.fec.type = 0; 532 | + } 533 | + 534 | tcp_mtup_init(sk); 535 | tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); 536 | tcp_initialize_rcv_mss(sk); 537 | @@ -5740,6 +5765,10 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb, 538 | 539 | tcp_initialize_rcv_mss(sk); 540 | tcp_fast_path_on(tp); 541 | + 542 | + /* SYN requested FEC usage */ 543 | + if (tp->rx_opt.fec.type > 0) 544 | + tp->fec.type = tp->rx_opt.fec.type; 545 | } else { 546 | return 1; 547 | } 548 | diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c 549 | index 7999fc5..04e3bf0 100644 550 | --- a/net/ipv4/tcp_ipv4.c 551 | +++ b/net/ipv4/tcp_ipv4.c 552 | @@ -74,6 +74,7 @@ 553 | #include 554 | #include 555 | #include 556 | +#include 557 | #include 558 | 559 | #include 560 | @@ -213,6 +214,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) 561 | 562 | tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT; 563 | 564 | + memset(&(tp->rx_opt.fec), 0, sizeof(struct tcp_fec)); 565 | + 566 | /* Socket identity is still unknown (sport may be zero). 567 | * However we set state to SYN-SENT and not releasing socket 568 | * lock select source port, enter ourselves into the hash tables and 569 | diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c 570 | index 0f01788..1d0bf2f 100644 571 | --- a/net/ipv4/tcp_minisocks.c 572 | +++ b/net/ipv4/tcp_minisocks.c 573 | @@ -483,6 +483,13 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, 574 | newtp->fastopen_rsk = NULL; 575 | newtp->syn_data_acked = 0; 576 | 577 | + /* TCP FEC option */ 578 | + newtp->rx_opt.fec.type = sysctl_tcp_fec ? req->fec_type : 0; 579 | + newtp->fec.type = newtp->fec.flags = 0; 580 | + newtp->fec.next_seq = newtp->snd_nxt; 581 | + newtp->fec.bytes_rcv_queue = 0; 582 | + skb_queue_head_init(&newtp->fec.rcv_queue); 583 | + 584 | TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_PASSIVEOPENS); 585 | } 586 | return newsk; 587 | diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c 588 | index ec335fa..00daf84 100644 589 | --- a/net/ipv4/tcp_output.c 590 | +++ b/net/ipv4/tcp_output.c 591 | @@ -37,6 +37,7 @@ 592 | #define pr_fmt(fmt) "TCP: " fmt 593 | 594 | #include 595 | +#include 596 | 597 | #include 598 | #include 599 | @@ -65,6 +66,8 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS; 600 | /* By default, RFC2861 behavior. */ 601 | int sysctl_tcp_slow_start_after_idle __read_mostly = 1; 602 | 603 | +int sysctl_tcp_fec __read_mostly; 604 | + 605 | static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, 606 | int push_one, gfp_t gfp); 607 | 608 | @@ -381,6 +384,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp) 609 | #define OPTION_MD5 (1 << 2) 610 | #define OPTION_WSCALE (1 << 3) 611 | #define OPTION_FAST_OPEN_COOKIE (1 << 8) 612 | +#define OPTION_FEC (1 << 9) 613 | 614 | struct tcp_out_options { 615 | u16 options; /* bit field of OPTION_* */ 616 | @@ -391,6 +395,7 @@ struct tcp_out_options { 617 | __u8 *hash_location; /* temporary pointer, overloaded */ 618 | __u32 tsval, tsecr; /* need to include OPTION_TS */ 619 | struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ 620 | + struct tcp_fec fec; /* FEC parameters */ 621 | }; 622 | 623 | /* Write previously computed TCP options to the packet. 624 | @@ -490,6 +495,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp, 625 | } 626 | ptr += (foc->len + 3) >> 2; 627 | } 628 | + 629 | + if (unlikely(OPTION_FEC & options)) 630 | + tcp_fec_encode_option(tp, &(opts->fec), &ptr); 631 | } 632 | 633 | /* Compute TCP options for SYN packets. This is not the final 634 | @@ -553,6 +561,14 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, 635 | } 636 | } 637 | 638 | + /* Prepare for FEC negotation if requested */ 639 | + if (unlikely(tcp_fec_is_enabled(tp)) && 640 | + remaining >= TCPOLEN_EXP_FEC_NEGOTIATION_ALIGNED) { 641 | + opts->options |= OPTION_FEC; 642 | + opts->fec.type = tp->fec.type; 643 | + remaining -= TCPOLEN_EXP_FEC_NEGOTIATION_ALIGNED; 644 | + } 645 | + 646 | return MAX_TCP_OPTION_SPACE - remaining; 647 | } 648 | 649 | @@ -614,6 +630,16 @@ static unsigned int tcp_synack_options(struct sock *sk, 650 | } 651 | } 652 | 653 | + /* Handle request for FEC support from other side 654 | + * (respond with same FEC option if FEC is locally supported) 655 | + */ 656 | + if (sysctl_tcp_fec && unlikely(req->fec_type) && 657 | + remaining >= TCPOLEN_EXP_FEC_NEGOTIATION_ALIGNED) { 658 | + opts->options |= OPTION_FEC; 659 | + opts->fec.type = req->fec_type; 660 | + remaining -= TCPOLEN_EXP_FEC_NEGOTIATION_ALIGNED; 661 | + } 662 | + 663 | return MAX_TCP_OPTION_SPACE - remaining; 664 | } 665 | 666 | @@ -657,6 +683,19 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb 667 | opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; 668 | } 669 | 670 | + /* Prepare option if connection has FEC enabled */ 671 | + if (tcp_fec_is_enabled(tp)) { 672 | + opts->options |= OPTION_FEC; 673 | + if (tcb && tcb->fec) 674 | + opts->fec = *(tcb->fec); 675 | + 676 | + /* regardless of packet type we need 4 more bytes 677 | + * including alignment 678 | + */ 679 | + size += 4; 680 | + size += TCPOLEN_EXP_FEC_BASE; 681 | + } 682 | + 683 | return size; 684 | } 685 | 686 | @@ -2956,6 +2995,12 @@ int tcp_connect(struct sock *sk) 687 | */ 688 | tp->snd_nxt = tp->write_seq; 689 | tp->pushed_seq = tp->write_seq; 690 | + 691 | + /* Initialize FEC members */ 692 | + tp->fec.next_seq = tp->snd_nxt; 693 | + tp->fec.bytes_rcv_queue = 0; 694 | + skb_queue_head_init(&tp->fec.rcv_queue); 695 | + 696 | TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS); 697 | 698 | /* Timer for repeating the SYN until an answer. */ 699 | diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c 700 | index 0a17ed9..dc2d12a 100644 701 | --- a/net/ipv6/tcp_ipv6.c 702 | +++ b/net/ipv6/tcp_ipv6.c 703 | @@ -288,6 +288,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, 704 | 705 | tp->rx_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); 706 | 707 | + memset(&(tp->rx_opt.fec), 0, sizeof(struct tcp_fec)); 708 | + 709 | inet->inet_dport = usin->sin6_port; 710 | 711 | tcp_set_state(sk, TCP_SYN_SENT); 712 | -- 713 | 2.1.0.rc2.206.gedb03e5 714 | 715 | -------------------------------------------------------------------------------- /v3.10/0002-net-tcp-TCP-with-Forward-Error-Correction-Receiver.patch: -------------------------------------------------------------------------------- 1 | From 26339b95439aa587066938965f033e68a0d6f6c5 Mon Sep 17 00:00:00 2001 2 | From: Tobias Flach 3 | Date: Mon, 25 Aug 2014 16:46:16 -0700 4 | Subject: [PATCH] net-tcp: TCP with Forward Error Correction (Receiver) 5 | 6 | Implemetation of the receiver part of forward error correction in TCP. 7 | 8 | Implemented components: 9 | * Detection of an FEC packet: 10 | - FEC-encoded packets have the ENCODED flag set in the FEC flags 11 | byte. If a packet does not carry an FEC option at all, the 12 | packet is discarded. 13 | * Payload recovery (decoding): 14 | - The receiver keeps up to (currently) 16000 in-order bytes in the 15 | buffer for possible recoveries. Data is not duplicated, instead 16 | extra references are kept for the in-order SKBs to avoid them 17 | being freed once they are consumed by higher layers. 18 | - Once an FEC packet is received, the receiver tries to recover 19 | any encoded data which was not received yet by reversing the 20 | encoding steps. 21 | - If a byte block can be reconstructed, an SKB is allocated and 22 | a TCP header attached, before the new packet is forwarded to 23 | regular reception routines. 24 | * Acknowledgements: 25 | - On successful recovery, every outgoing packet has the FEC flag 26 | RECOVERY_SUCCESS enabled. On reception of this flag, the 27 | receiver reduces the congestion window (similarly to ECN) 28 | and sets the FEC flag RECOVERY_CWR in the next outgoing packet. 29 | The RECOVERY_SUCCESS flag is transmitted for every packet until 30 | RECOVERY_ACK has been received (similarly to sending ECE until 31 | CWR is received in ECN). 32 | - On failed recovery, an extra acknowledgement for rcv_nxt is 33 | generated. The FEC flag RECOVERY_FAIL is enabled and the remaining 34 | 24 bits in the option encode the number of bytes after ack_seq 35 | which are considered loss. The receiver marks this byte 36 | range as lost and can start retransmissions. 37 | --- 38 | include/net/tcp_fec.h | 27 ++ 39 | net/ipv4/tcp_fec.c | 732 +++++++++++++++++++++++++++++++++++++++++++++++ 40 | net/ipv4/tcp_input.c | 40 ++- 41 | net/ipv4/tcp_minisocks.c | 2 + 42 | 4 files changed, 798 insertions(+), 3 deletions(-) 43 | 44 | diff --git a/include/net/tcp_fec.h b/include/net/tcp_fec.h 45 | index ba219d1..1660e58 100644 46 | --- a/include/net/tcp_fec.h 47 | +++ b/include/net/tcp_fec.h 48 | @@ -50,4 +50,31 @@ int tcp_fec_decode_option(struct tcp_fec *fec, u32 seq, u32 ack_seq, 49 | int tcp_fec_encode_option(struct tcp_sock *tp, struct tcp_fec *fec, 50 | __be32 **ptr); 51 | 52 | +/* 53 | + * Processes the current packet in the buffer (treated as FEC packet) 54 | + */ 55 | +int tcp_fec_process(struct sock *sk, struct sk_buff *skb); 56 | + 57 | +/* 58 | + * Checks the received options for loss indicators and acts upon them. 59 | + * In particular, the function handles window reduction requests and processes 60 | + * tail loss indicators. 61 | + * Returns: 1, if window is reduced - 0, otherwise 62 | + */ 63 | +int tcp_fec_check_ack(struct sock *sk, u32 ack_seq); 64 | + 65 | +/* 66 | + * Since data in the socket's receive queue can get consumed by other parties 67 | + * we need to keep extra references these SKBs until they are no longer 68 | + * required for possible future recoveries. 69 | + * @skb - buffer which is moved to the receive queue 70 | + */ 71 | +int tcp_fec_update_queue(struct sock *sk, struct sk_buff *skb); 72 | + 73 | +/* 74 | + * Disables FEC for this connection (includes clearing references 75 | + * to buffers in receive queue) 76 | + */ 77 | +void tcp_fec_disable(struct sock *sk); 78 | + 79 | #endif 80 | diff --git a/net/ipv4/tcp_fec.c b/net/ipv4/tcp_fec.c 81 | index 97a48ce..3a8bd6d 100644 82 | --- a/net/ipv4/tcp_fec.c 83 | +++ b/net/ipv4/tcp_fec.c 84 | @@ -1,5 +1,30 @@ 85 | #include 86 | 87 | +/* Codes for incoming FEC packet processing */ 88 | +#define FEC_NO_LOSS 1 89 | +#define FEC_LOSS_UNRECOVERED 2 90 | +#define FEC_LOSS_RECOVERED 3 91 | + 92 | +/* Receiver routines */ 93 | +static int tcp_fec_process_xor(struct sock *sk, const struct sk_buff *skb, 94 | + unsigned int block_skip); 95 | +static int tcp_fec_recover(struct sock *sk, const struct sk_buff *skb, 96 | + unsigned char *data, u32 seq, int len); 97 | +static void tcp_fec_send_ack(struct sock *sk, const struct sk_buff *skb, 98 | + int recovery_status); 99 | +static void tcp_fec_reduce_window(struct sock *sk); 100 | +static void tcp_fec_mark_skbs_lost(struct sock *sk); 101 | +static bool tcp_fec_update_decoded_option(struct sk_buff *skb); 102 | +static struct sk_buff *tcp_fec_make_decoded_pkt(struct sock *sk, 103 | + const struct sk_buff *skb, unsigned char *dec_data, 104 | + u32 seq, unsigned int len); 105 | + 106 | +/* Buffer access routine */ 107 | +static unsigned int tcp_fec_get_next_block(struct sock *sk, 108 | + struct sk_buff **skb, struct sk_buff_head *queue, 109 | + u32 seq, unsigned int block_len, 110 | + unsigned char *block); 111 | + 112 | /* Decodes FEC parameters and stores them in the FEC struct 113 | * @seq - sequence number of the packet 114 | * @ack_seq - ACKed sequence number 115 | @@ -122,3 +147,710 @@ int tcp_fec_encode_option(struct tcp_sock *tp, struct tcp_fec *fec, 116 | return 8; 117 | } 118 | } 119 | + 120 | +/* Processes the current packet in the buffer, treated as an FEC packet 121 | + * (assumes that options were already processed) 122 | + */ 123 | +int tcp_fec_process(struct sock *sk, struct sk_buff *skb) 124 | +{ 125 | + struct tcp_sock *tp; 126 | + struct tcphdr *th; 127 | + int recovery_status, err; 128 | + u32 end_seq; 129 | + 130 | + tp = tcp_sk(sk); 131 | + th = tcp_hdr(skb); 132 | + recovery_status = 0; 133 | + 134 | + /* drop packet if packet is not encoded */ 135 | + if (!(tp->rx_opt.fec.flags & TCP_FEC_ENCODED)) 136 | + return -1; 137 | + 138 | + /* check if all encoded packets were already received */ 139 | + end_seq = tp->rx_opt.fec.enc_seq + tp->rx_opt.fec.enc_len; 140 | + if (!after(end_seq, tp->rcv_nxt)) { 141 | + tcp_fec_send_ack(sk, skb, FEC_NO_LOSS); 142 | + return 0; 143 | + } 144 | + 145 | + /* linearize the SKB (for easier payload access) */ 146 | + err = skb_linearize(skb); 147 | + if (err) 148 | + return err; 149 | + 150 | + /* data recovery */ 151 | + switch (tp->fec.type) { 152 | + case TCP_FEC_TYPE_NONE: 153 | + return -1; 154 | + case TCP_FEC_TYPE_XOR_ALL: 155 | + recovery_status = tcp_fec_process_xor(sk, skb, 0); 156 | + break; 157 | + case TCP_FEC_TYPE_XOR_SKIP_1: 158 | + recovery_status = tcp_fec_process_xor(sk, skb, 1); 159 | + break; 160 | + } 161 | + 162 | + /* TODO error handling; -ENOMEM, etc. - disable FEC? */ 163 | + if (recovery_status < 0) 164 | + return recovery_status; 165 | + 166 | + /* Send an explicit ACK if recovery failed */ 167 | + if (recovery_status == FEC_LOSS_UNRECOVERED) 168 | + tcp_fec_send_ack(sk, skb, recovery_status); 169 | + 170 | + return 0; 171 | +} 172 | + 173 | +/* Checks the received options for loss indicators and acts upon them. 174 | + * In particular, the function handles recovery flags (indicators for 175 | + * successful and failed recoveries, tail losses) 176 | + * Returns: 1, if ACK contains a loss indicator 177 | + */ 178 | +int tcp_fec_check_ack(struct sock *sk, u32 ack_seq) 179 | +{ 180 | + struct tcp_sock *tp; 181 | + 182 | + tp = tcp_sk(sk); 183 | + 184 | + /* Clear local recovery indication (and ECN CWR demand) 185 | + * if it was ACKED by the other node 186 | + */ 187 | + if (tp->rx_opt.fec.flags & TCP_FEC_RECOVERY_CWR) { 188 | + tp->fec.flags &= ~TCP_FEC_RECOVERY_SUCCESSFUL; 189 | + tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR; 190 | + } 191 | + 192 | + /* Check for tail loss indicators 193 | + * This happens when FEC was unable to recover the lost data and 194 | + * thus only sends an ACK with the loss range back. Everything not 195 | + * ACKed/SACKed now, is considered lost now. 196 | + */ 197 | + if (tp->rx_opt.fec.flags & TCP_FEC_RECOVERY_FAILED) { 198 | + tcp_fec_mark_skbs_lost(sk); 199 | + return 1; 200 | + } 201 | + 202 | + /* Check if the remote endpoint successfully recovered data, 203 | + * if so we trigger a window reduction 204 | + */ 205 | + if (tp->rx_opt.fec.flags & TCP_FEC_RECOVERY_SUCCESSFUL) { 206 | + /* Ignore flag if window was already reduced for the current 207 | + * loss episode or if previous reduction was not signaled 208 | + * yet (no outgoing packets) 209 | + */ 210 | + if (after(ack_seq, tp->high_seq) && 211 | + !(tp->fec.flags & TCP_FEC_RECOVERY_CWR)) { 212 | + tcp_fec_reduce_window(sk); 213 | + tp->fec.flags |= TCP_FEC_RECOVERY_CWR; 214 | + } 215 | + 216 | + return 1; 217 | + } 218 | + 219 | + return 0; 220 | +} 221 | + 222 | +/* Since data in the socket's receive queue can get consumed by other parties 223 | + * we need to clone these SKBs until they are no longer required for possible 224 | + * future recoveries. This function is called after the TCP header has been 225 | + * removed from the SKB already. All parameters required for recovery are 226 | + * stored in the SKB's control buffer. 227 | + * @skb - buffer which is moved to the receive queue 228 | + */ 229 | +int tcp_fec_update_queue(struct sock *sk, struct sk_buff *skb) 230 | +{ 231 | + struct tcp_sock *tp; 232 | + struct sk_buff *cskb; 233 | + u32 data_len; 234 | + int extra_bytes, err; 235 | + tp = tcp_sk(sk); 236 | + 237 | + /* clone the SKB and add it to the FEC receive queue 238 | + * (a simple extra reference to the SKB is not sufficient since 239 | + * since SKBs can only be queued on one list at a time) 240 | + */ 241 | + cskb = skb_clone(skb, GFP_ATOMIC); 242 | + if (cskb == NULL) 243 | + return -ENOMEM; 244 | + 245 | + /* linearize the SKB (for easier payload access) */ 246 | + err = skb_linearize(cskb); 247 | + if (err) 248 | + return err; 249 | + 250 | + data_len = skb->len; 251 | + if (!data_len) { 252 | + kfree_skb(cskb); 253 | + return 0; 254 | + } 255 | + 256 | + skb_queue_tail(&tp->fec.rcv_queue, cskb); 257 | + tp->fec.bytes_rcv_queue += data_len; 258 | + 259 | + /* check if we can dereference old SKBs (as long as we have enough 260 | + * data for future recoveries) 261 | + */ 262 | + extra_bytes = tp->fec.bytes_rcv_queue - FEC_RCV_QUEUE_LIMIT; 263 | + while (extra_bytes > 0) { 264 | + cskb = skb_peek(&tp->fec.rcv_queue); 265 | + if (cskb == NULL) 266 | + return -EINVAL; 267 | + 268 | + data_len = TCP_SKB_CB(cskb)->end_seq - TCP_SKB_CB(cskb)->seq; 269 | + if (data_len > extra_bytes) { 270 | + break; 271 | + } else { 272 | + extra_bytes -= data_len; 273 | + tp->fec.bytes_rcv_queue -= data_len; 274 | + skb_unlink(cskb, &tp->fec.rcv_queue); 275 | + kfree_skb(cskb); 276 | + } 277 | + } 278 | + 279 | + return 0; 280 | +} 281 | + 282 | +/* Disables FEC for this connection (includes clearing references 283 | + * to buffers in receive queue) 284 | + */ 285 | +void tcp_fec_disable(struct sock *sk) 286 | +{ 287 | + struct tcp_sock *tp = tcp_sk(sk); 288 | + 289 | + if (!tcp_fec_is_enabled(tp)) 290 | + return; 291 | + 292 | + tp->fec.type = 0; 293 | + tp->fec.bytes_rcv_queue = 0; 294 | + skb_queue_purge(&tp->fec.rcv_queue); 295 | +} 296 | + 297 | +/* Processes the current packet in the buffer, treated as an FEC packet 298 | + * with XOR-encoded payload (assumes that options were already processed) 299 | + * Returns: negative code, if an error occurred; 300 | + * positive code, otherwise (recovery status) 301 | + * @block_skip - Number of unencoded blocks between two encoded blocks 302 | + */ 303 | +static int tcp_fec_process_xor(struct sock *sk, const struct sk_buff *skb, 304 | + unsigned int block_skip) 305 | +{ 306 | + struct sk_buff *pskb; 307 | + struct tcp_sock *tp; 308 | + struct tcphdr *th; 309 | + u32 next_seq, end_seq, rec_seq; 310 | + unsigned char *data, *block; 311 | + unsigned int i, offset, data_len, block_len, rec_len; 312 | + bool seen_loss; 313 | + int ret; 314 | + 315 | + pskb = NULL; 316 | + tp = tcp_sk(sk); 317 | + th = tcp_hdr(skb); 318 | + next_seq = tp->rx_opt.fec.enc_seq; 319 | + end_seq = next_seq + tp->rx_opt.fec.enc_len; 320 | + block_len = skb->len - tcp_hdrlen(skb); 321 | + seen_loss = false; 322 | + offset = 0; 323 | + 324 | + /* memory allocation for decoding / recovered SKB data */ 325 | + data = kmalloc(2 * block_len, GFP_ATOMIC); 326 | + if (data == NULL) 327 | + return -ENOMEM; 328 | + 329 | + block = data + block_len; 330 | + 331 | + /* copy FEC payload (skip TCP header) */ 332 | + memcpy(data, skb->data + tcp_hdrlen(skb), block_len); 333 | + 334 | + /* process in-sequence data */ 335 | + while ((data_len = tcp_fec_get_next_block(sk, &pskb, 336 | + &tp->fec.rcv_queue, next_seq, 337 | + min(block_len, end_seq - next_seq), 338 | + block))) { 339 | + next_seq += data_len; 340 | + 341 | + /* XOR with existing payload */ 342 | + for (i = 0; i < data_len; i++) 343 | + data[i] ^= block[i]; 344 | + 345 | + /* we could no read a whole MSS block, which means we 346 | + * reached the end of the queue or end of range which the 347 | + * FEC packet covers 348 | + */ 349 | + if (data_len < block_len) 350 | + break; 351 | + 352 | + /* skip unencoded blocks if there is more data encoded */ 353 | + if (end_seq - next_seq > 0) 354 | + next_seq += block_len * block_skip; 355 | + } 356 | + 357 | + /* check if all encoded bytes were already received */ 358 | + if (next_seq == end_seq) { 359 | + kfree(data); 360 | + return FEC_NO_LOSS; 361 | + } 362 | + 363 | + /* we always recover one whole MSS block (otherwise slicing 364 | + * would introduce a lot of additional complexity here) and handle 365 | + * cut out already received sequences later 366 | + */ 367 | + rec_seq = next_seq; 368 | + rec_len = min(block_len, end_seq - rec_seq); 369 | + offset = data_len; 370 | + if ((rec_seq + rec_len) == end_seq) 371 | + goto recover; 372 | + 373 | + next_seq += block_len * (block_skip + 1); 374 | + pskb = NULL; 375 | + 376 | + /* read a possibly partial (smaller than MSS) block to fill up the 377 | + * previously unfilled block and achieve alignment again 378 | + */ 379 | + data_len = tcp_fec_get_next_block(sk, &pskb, &tp->out_of_order_queue, 380 | + next_seq, block_len - offset, block); 381 | + 382 | + next_seq += data_len; 383 | + 384 | + /* check if we could not read as much data as requested */ 385 | + if ((next_seq != end_seq) && (data_len < (block_len - offset))) 386 | + goto clean; 387 | + 388 | + /* XOR with existing payload */ 389 | + for (i = 0; i < data_len; i++) 390 | + data[i+offset] ^= block[i]; 391 | + 392 | + /* skip unencoded blocks if there is more data encoded */ 393 | + if (end_seq - next_seq > 0) 394 | + next_seq += block_len * block_skip; 395 | + 396 | + /* read all necessary blocks to finish decoding */ 397 | + while ((data_len = tcp_fec_get_next_block(sk, &pskb, 398 | + &tp->out_of_order_queue, next_seq, 399 | + min(block_len, end_seq - next_seq), 400 | + block))) { 401 | + next_seq += data_len; 402 | + 403 | + /* XOR with existing payload */ 404 | + for (i = 0; i < data_len; i++) 405 | + data[i] ^= block[i]; 406 | + 407 | + /* we could not read a whole MSS block, which means we reached 408 | + * the end of the queue or end of range which the FEC packet 409 | + * covers 410 | + */ 411 | + if (data_len < block_len) 412 | + break; 413 | + 414 | + /* skip unencoded blocks if there is more data encoded */ 415 | + if (end_seq - next_seq > 0) 416 | + next_seq += block_len * block_skip; 417 | + } 418 | + 419 | + /* check if additional losses were observed (cannot recover) */ 420 | + if (next_seq != end_seq) 421 | + goto clean; 422 | + 423 | +recover: 424 | + /* create and process recovered packets */ 425 | + for (i = 0; i < rec_len; i++) 426 | + block[i] = data[(offset + i) % block_len]; 427 | + 428 | + if (block_skip && ((block_len - offset) < rec_len)) { 429 | + /* recover non-consecutive sequence ranges (only when 430 | + * slicing is used) 431 | + */ 432 | + u32 second_seq; 433 | + unsigned int second_seq_len, first_seq_len; 434 | + 435 | + first_seq_len = block_len - offset; 436 | + second_seq = rec_seq + first_seq_len + block_len * block_skip; 437 | + second_seq_len = rec_len - first_seq_len; 438 | + 439 | + ret = tcp_fec_recover(sk, skb, block, rec_seq, first_seq_len); 440 | + if (ret >= 0) { 441 | + int second_ret = tcp_fec_recover(sk, skb, 442 | + block + first_seq_len, 443 | + second_seq, second_seq_len); 444 | + if (second_ret < 0 || !ret) 445 | + ret = second_ret; 446 | + } 447 | + } else { 448 | + ret = tcp_fec_recover(sk, skb, block, rec_seq, rec_len); 449 | + } 450 | + 451 | + kfree(data); 452 | + return ret ? ret : FEC_LOSS_RECOVERED; 453 | + 454 | +clean: 455 | + kfree(data); 456 | + return FEC_LOSS_UNRECOVERED; 457 | +} 458 | + 459 | +/* Create a recovered packet and forward it to the reception routine */ 460 | +static int tcp_fec_recover(struct sock *sk, const struct sk_buff *skb, 461 | + unsigned char *data, u32 seq, int len) 462 | +{ 463 | + struct sk_buff *rskb; 464 | + struct tcp_sock *tp; 465 | + 466 | + tp = tcp_sk(sk); 467 | + 468 | + /* We will notify the remote node that recovery was successful */ 469 | + tp->fec.flags |= TCP_FEC_RECOVERY_SUCCESSFUL; 470 | + 471 | + /* Check if we received some tail of the recovered sequence already 472 | + * by looking at the current SACK blocks (we don't want to recover 473 | + * more data than necessary to prevent DSACKS) 474 | + */ 475 | + if (tcp_is_sack(tp)) { 476 | + int i; 477 | + for (i = 0; i < tp->rx_opt.num_sacks; i++) { 478 | + if (before(tp->selective_acks[i].start_seq, 479 | + seq + len) && 480 | + !before(tp->selective_acks[i].end_seq, 481 | + seq + len)) { 482 | + len = tp->selective_acks[i].start_seq - seq; 483 | + break; 484 | + } 485 | + } 486 | + } 487 | + 488 | + /* We might have prematurely asked for a recovery in the case where the 489 | + * whole recovery sequence is already covered by SACKs 490 | + */ 491 | + if (len <= 0) 492 | + return FEC_NO_LOSS; 493 | + 494 | + /* Create decoded packet and forward to reception routine */ 495 | + rskb = tcp_fec_make_decoded_pkt(sk, skb, data, seq, len); 496 | + if (rskb == NULL) 497 | + return -EINVAL; 498 | + 499 | + return tcp_rcv_established(sk, rskb, tcp_hdr(rskb), rskb->len); 500 | +} 501 | + 502 | +/* Sends an ACK for the FEC packet and encodes any congestion or 503 | + * and/or recovery information 504 | + */ 505 | +static void tcp_fec_send_ack(struct sock *sk, const struct sk_buff *skb, 506 | + int recovery_status) 507 | +{ 508 | + struct tcp_sock *tp; 509 | + u32 end_seq; 510 | + 511 | + tp = tcp_sk(sk); 512 | + 513 | + /* Right now we only need an outgoing ACK if FEC recovery failed, 514 | + * in all other cases ACKs are implicitly generated 515 | + */ 516 | + switch (recovery_status) { 517 | + case FEC_LOSS_UNRECOVERED: 518 | + end_seq = tp->rx_opt.fec.enc_seq + tp->rx_opt.fec.enc_len; 519 | + tp->fec.flags |= TCP_FEC_RECOVERY_FAILED; 520 | + tp->fec.lost_len = end_seq - tp->rcv_nxt; 521 | + tcp_send_ack(sk); 522 | + break; 523 | + } 524 | +} 525 | + 526 | +/* Reduces the congestion window (similar to completed fast recovery) 527 | + * If the node is already in recovery mode, undo is disabled to enforce 528 | + * the window reduction upon completion 529 | + */ 530 | +static void tcp_fec_reduce_window(struct sock *sk) 531 | +{ 532 | + struct tcp_sock *tp; 533 | + const struct inet_connection_sock *icsk; 534 | + 535 | + tp = tcp_sk(sk); 536 | + icsk = inet_csk(sk); 537 | + 538 | + if (icsk->icsk_ca_state < TCP_CA_CWR) { 539 | + tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); 540 | + if (tp->snd_ssthresh < TCP_INFINITE_SSTHRESH) { 541 | + tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); 542 | + tp->snd_cwnd_stamp = tcp_time_stamp; 543 | + } 544 | + 545 | + /* Any future window reduction requests are ignored until 546 | + * snd_nxt is ACKed 547 | + */ 548 | + tp->high_seq = tp->snd_nxt; 549 | + tp->undo_marker = 0; 550 | + } else { 551 | + /* Socket is in some congestion mode and we only need to make 552 | + * sure that window reduction is executed when recovery 553 | + * is finished 554 | + */ 555 | + tp->undo_marker = 0; 556 | + } 557 | +} 558 | + 559 | +/* The incoming ACK indicates a failed recovery. 560 | + * Mark all unacked SKBs in the loss range as lost. 561 | + * TODO With interleaved coding, we have the additional constraint 562 | + * that the SKBs in the loss range also must have been encoded the 563 | + * triggering FEC packet, and for that we need to keep some info 564 | + * about FEC packets on the sender side 565 | + */ 566 | +static void tcp_fec_mark_skbs_lost(struct sock *sk) 567 | +{ 568 | + struct tcp_sock *tp; 569 | + struct sk_buff *skb; 570 | + u32 start_seq, end_seq; 571 | + 572 | + tp = tcp_sk(sk); 573 | + skb = tp->lost_skb_hint ? tp->lost_skb_hint : tcp_write_queue_head(sk); 574 | + 575 | + /* All SKBs falling completely in the range are marked */ 576 | + start_seq = tp->rx_opt.fec.lost_seq; 577 | + end_seq = tp->rx_opt.fec.lost_seq + tp->rx_opt.fec.lost_len; 578 | + 579 | + tcp_for_write_queue_from(skb, sk) { 580 | + if (skb == tcp_send_head(sk)) 581 | + break; 582 | + 583 | + /* Past loss range */ 584 | + if (!before(TCP_SKB_CB(skb)->seq, end_seq)) 585 | + break; 586 | + 587 | + /* SKB not (fully) within range */ 588 | + if (before(TCP_SKB_CB(skb)->seq, start_seq) || 589 | + after(TCP_SKB_CB(skb)->end_seq, end_seq)) 590 | + continue; 591 | + 592 | + /* SKB already marked */ 593 | + if (TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED)) 594 | + continue; 595 | + 596 | + /* Verify retransmit hint before marking 597 | + * (see tcp_verify_retransmit_hint(), 598 | + * copied since method defined static in tcp_input.c) 599 | + */ 600 | + if ((tp->retransmit_skb_hint == NULL) || 601 | + before(TCP_SKB_CB(skb)->seq, 602 | + TCP_SKB_CB(tp->retransmit_skb_hint)->seq)) 603 | + tp->retransmit_skb_hint = skb; 604 | + 605 | + if (!tp->lost_out || 606 | + after(TCP_SKB_CB(skb)->end_seq, tp->retransmit_high)) 607 | + tp->retransmit_high = TCP_SKB_CB(skb)->end_seq; 608 | + 609 | + /* Mark SKB as lost (see tcp_skb_mark_lost()) */ 610 | + tp->lost_out += tcp_skb_pcount(skb); 611 | + TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; 612 | + } 613 | + 614 | + tcp_verify_left_out(tp); 615 | +} 616 | + 617 | +/* Searches for the FEC option in the packet header and replaces 618 | + * the long option with a short one padded by NOPs. 619 | + * This is done to convert the option used by an encoded packet 620 | + * to the option used by a recovered packet. 621 | + */ 622 | +static bool tcp_fec_update_decoded_option(struct sk_buff *skb) 623 | +{ 624 | + struct tcphdr *th; 625 | + unsigned char *ptr; 626 | + int length; 627 | + 628 | + th = tcp_hdr(skb); 629 | + ptr = (unsigned char *) (th + 1); 630 | + length = (th->doff * 4) - sizeof(struct tcphdr); 631 | + 632 | + while (length > 0) { 633 | + int opcode = *ptr++; 634 | + int opsize; 635 | + 636 | + switch (opcode) { 637 | + case TCPOPT_EOL: 638 | + return 0; 639 | + case TCPOPT_NOP: 640 | + length--; 641 | + continue; 642 | + default: 643 | + opsize = *ptr++; 644 | + if (opsize < 2 || opsize > length) 645 | + return 0; 646 | + 647 | + if (opcode == TCPOPT_EXP && 648 | + get_unaligned_be16(ptr) == TCPOPT_FEC_MAGIC) { 649 | + /* Update FEC option: 650 | + * 1. Convert long option into short option 651 | + * 2. Clear ENCODED flag (keep other flags) 652 | + * 3. Replace option value (long option) by NOPs 653 | + */ 654 | + u32 *fec_opt_start = (u32 *) (ptr - 2); 655 | + *fec_opt_start = htonl(( 656 | + get_unaligned_be32(fec_opt_start) & 657 | + 0xFF00FFFF) | 0x00050000); 658 | + *(fec_opt_start + 1) = htonl(( 659 | + get_unaligned_be32(fec_opt_start + 1) & 660 | + 0xEF000000) | 0x00010101); 661 | + 662 | + return 1; 663 | + } 664 | + 665 | + ptr += opsize - 2; 666 | + length -= opsize; 667 | + } 668 | + } 669 | + 670 | + return 0; 671 | +} 672 | + 673 | +/* Allocates an SKB for data we want to forward to reception routines 674 | + * (recovered data) by making a copy of the FEC SKB and replacing the data 675 | + * part, all other segments (options, etc.) are preserved 676 | + */ 677 | +static struct sk_buff *tcp_fec_make_decoded_pkt(struct sock *sk, 678 | + const struct sk_buff *skb, 679 | + unsigned char *dec_data, 680 | + u32 seq, unsigned int len) 681 | +{ 682 | + struct tcp_sock *tp; 683 | + struct sk_buff *nskb; 684 | + 685 | + tp = tcp_sk(sk); 686 | + nskb = skb_copy(skb, GFP_ATOMIC); 687 | + if (nskb == NULL) 688 | + return NULL; 689 | + 690 | + /* Update FEC option for the new packet */ 691 | + if (!tcp_fec_update_decoded_option(nskb)) { 692 | + /* TODO Do we need this catch? Technically we don't reach this 693 | + * method if there is no FEC option in the header. 694 | + */ 695 | + return NULL; 696 | + } 697 | + 698 | + /* check if we received some tail of the recovered sequence already 699 | + * by looking at the current SACK blocks (we don't want to recover 700 | + * more data than necessary to prevent DSACKS) 701 | + */ 702 | + if (tcp_is_sack(tp)) { 703 | + int i; 704 | + for (i = 0; i < tp->rx_opt.num_sacks; i++) { 705 | + if (before(tp->selective_acks[i].start_seq, 706 | + seq + len) && 707 | + !before(tp->selective_acks[i].end_seq, 708 | + seq + len)) { 709 | + len = tp->selective_acks[i].start_seq - seq; 710 | + break; 711 | + } 712 | + } 713 | + } 714 | + 715 | + /* trim data section to fit recovered sequence if necessary */ 716 | + if (len < (TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq)) 717 | + skb_trim(nskb, len + tcp_hdrlen(nskb)); 718 | + 719 | + /* fix the sequence numbers */ 720 | + tcp_hdr(nskb)->seq = htonl(seq); 721 | + tcp_hdr(nskb)->ack_seq = htonl(tp->snd_una); 722 | + TCP_SKB_CB(nskb)->seq = seq; 723 | + TCP_SKB_CB(nskb)->end_seq = seq + len; 724 | + 725 | + /* replace SKB payload with recovered data */ 726 | + memcpy(nskb->data + tcp_hdrlen(nskb), dec_data, len); 727 | + 728 | + /* packets used for recovery had their checksums checked already */ 729 | + nskb->ip_summed = CHECKSUM_UNNECESSARY; 730 | + 731 | + return nskb; 732 | +} 733 | + 734 | +/* Gets the next byte block from an SKB queue (any SKB which is touched 735 | + * in this procedure will be linearized to simplify payload access) 736 | + * @skb - Points to SKB from which previous block was extracted (useful 737 | + * for successive calls to this function, which avoids moving through 738 | + * the whole queue again) 739 | + * @queue - SKB queue to read from (SKB has to point to an element on this 740 | + * queue) 741 | + * @seq - Sequence number of first byte in the block 742 | + * @block_len 743 | + * @block 744 | + * 745 | + * Returns the bytes written to the block memory 746 | + */ 747 | +static unsigned int tcp_fec_get_next_block(struct sock *sk, 748 | + struct sk_buff **skb, 749 | + struct sk_buff_head *queue, u32 seq, 750 | + unsigned int block_len, unsigned char *block) 751 | +{ 752 | + unsigned int cur_len, offset, num_bytes; 753 | + int err; 754 | + u32 end_seq; 755 | + 756 | + cur_len = 0; 757 | + 758 | + /* Get first SKB of the write queue and specify next sequence to 759 | + * encode 760 | + */ 761 | + if (*skb == NULL) { 762 | + *skb = skb_peek(queue); 763 | + if (*skb == NULL) 764 | + return 0; 765 | + } 766 | + 767 | + /* move to SKB which stores the next sequence to encode */ 768 | + while (*skb) { 769 | + /* If we observe an RST/SYN, we stop here to avoid 770 | + * handling corner cases 771 | + */ 772 | + if (TCP_SKB_CB(*skb)->tcp_flags & 773 | + (TCPHDR_RST | 774 | + TCPHDR_SYN)) 775 | + return 0; 776 | + if (!before(seq, TCP_SKB_CB(*skb)->seq) && 777 | + before(seq, TCP_SKB_CB(*skb)->end_seq)) 778 | + break; 779 | + if (*skb == skb_peek_tail(queue)) { 780 | + *skb = NULL; 781 | + break; 782 | + } 783 | + 784 | + *skb = skb_queue_next(queue, *skb); 785 | + } 786 | + 787 | + if (*skb == NULL) 788 | + return 0; 789 | + 790 | + /* copy bytes from SKBs (connected sequences) */ 791 | + while (*skb && (cur_len < block_len)) { 792 | + err = skb_linearize(*skb); 793 | + if (err) 794 | + return err; 795 | + 796 | + /* Deal with the end seq number being incremented by 797 | + * one if the FIN flag is set (we don't want to encode this) 798 | + */ 799 | + end_seq = TCP_SKB_CB(*skb)->end_seq; 800 | + if (TCP_SKB_CB(*skb)->tcp_flags & TCPHDR_FIN) 801 | + end_seq--; 802 | + 803 | + if ((seq >= TCP_SKB_CB(*skb)->seq) && (seq < end_seq)) { 804 | + /* Copy data depending on: 805 | + * - remaining space in the block 806 | + * - remaining data in the SKB 807 | + */ 808 | + offset = seq - TCP_SKB_CB(*skb)->seq; 809 | + num_bytes = min(block_len - cur_len, 810 | + end_seq - seq); 811 | + 812 | + memcpy(block + cur_len, (*skb)->data + offset, 813 | + num_bytes); 814 | + cur_len += num_bytes; 815 | + seq += num_bytes; 816 | + } 817 | + 818 | + if (*skb == skb_peek_tail(queue) || cur_len >= block_len) 819 | + break; 820 | + 821 | + *skb = skb_queue_next(queue, *skb); 822 | + } 823 | + 824 | + return cur_len; 825 | +} 826 | diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c 827 | index 3260498..4d17c5f 100644 828 | --- a/net/ipv4/tcp_input.c 829 | +++ b/net/ipv4/tcp_input.c 830 | @@ -107,6 +107,7 @@ int sysctl_tcp_early_retrans __read_mostly = 3; 831 | #define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */ 832 | #define FLAG_DATA_SACKED 0x20 /* New SACK. */ 833 | #define FLAG_ECE 0x40 /* ECE in this ACK */ 834 | +#define FLAG_FEC_CWR_REQUESTED 0x80 /* cwnd reduction requested */ 835 | #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ 836 | #define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */ 837 | #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ 838 | @@ -116,8 +117,9 @@ int sysctl_tcp_early_retrans __read_mostly = 3; 839 | 840 | #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) 841 | #define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED) 842 | -#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE) 843 | +#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE|FLAG_FEC_CWR_REQUESTED) 844 | #define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED) 845 | +#define FLAG_CONGESTION (FLAG_ECE|FLAG_FEC_CWR_REQUESTED) 846 | 847 | #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH) 848 | #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH)) 849 | @@ -2536,7 +2538,8 @@ void tcp_enter_cwr(struct sock *sk, const int set_ssthresh) 850 | struct tcp_sock *tp = tcp_sk(sk); 851 | 852 | tp->prior_ssthresh = 0; 853 | - if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) { 854 | + if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR && 855 | + after(tp->snd_una, tp->high_seq)) { 856 | tp->undo_marker = 0; 857 | tcp_init_cwnd_reduction(sk, set_ssthresh); 858 | tcp_set_ca_state(sk, TCP_CA_CWR); 859 | @@ -3195,7 +3198,7 @@ static inline bool tcp_ack_is_dubious(const struct sock *sk, const int flag) 860 | static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag) 861 | { 862 | const struct tcp_sock *tp = tcp_sk(sk); 863 | - return (!(flag & FLAG_ECE) || tp->snd_cwnd < tp->snd_ssthresh) && 864 | + return (!(flag & FLAG_CONGESTION) || tp->snd_cwnd < tp->snd_ssthresh) && 865 | !tcp_in_cwnd_reduction(sk); 866 | } 867 | 868 | @@ -3363,6 +3366,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) 869 | if (after(ack, prior_snd_una)) 870 | flag |= FLAG_SND_UNA_ADVANCED; 871 | 872 | + /* Check if FEC expects and executes a window reduction */ 873 | + if (tcp_fec_is_enabled(tp) && tcp_fec_check_ack(sk, ack)) 874 | + flag |= FLAG_FEC_CWR_REQUESTED; 875 | + 876 | prior_fackets = tp->fackets_out; 877 | prior_in_flight = tcp_packets_in_flight(tp); 878 | 879 | @@ -4059,6 +4066,9 @@ static void tcp_ofo_queue(struct sock *sk) 880 | tp->rcv_nxt, TCP_SKB_CB(skb)->seq, 881 | TCP_SKB_CB(skb)->end_seq); 882 | 883 | + if (tcp_fec_is_enabled(tp)) 884 | + tcp_fec_update_queue(sk, skb); 885 | + 886 | __skb_unlink(skb, &tp->out_of_order_queue); 887 | __skb_queue_tail(&sk->sk_receive_queue, skb); 888 | tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; 889 | @@ -4335,6 +4345,9 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) 890 | goto out_of_window; 891 | 892 | /* Ok. In sequence. In window. */ 893 | + if (tcp_fec_is_enabled(tp)) 894 | + tcp_fec_update_queue(sk, skb); 895 | + 896 | if (tp->ucopy.task == current && 897 | tp->copied_seq == tp->rcv_nxt && tp->ucopy.len && 898 | sock_owned_by_user(sk) && !tp->urg_data) { 899 | @@ -4653,6 +4666,12 @@ static int tcp_prune_queue(struct sock *sk) 900 | tp->copied_seq, tp->rcv_nxt); 901 | sk_mem_reclaim(sk); 902 | 903 | + /* Disable FEC if it was enabled to prevent keeping data 904 | + * in the receive queue longer than necessary 905 | + */ 906 | + if (tcp_fec_is_enabled(tp)) 907 | + tcp_fec_disable(sk); 908 | + 909 | if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf) 910 | return 0; 911 | 912 | @@ -5010,6 +5029,21 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, 913 | /* Reset is accepted even if it did not pass PAWS. */ 914 | } 915 | 916 | + /* Special processing if FEC is enabled */ 917 | + if (tcp_fec_is_enabled(tp)) { 918 | + if (tcp_fec_is_encoded(tp)) { 919 | + tcp_fec_process(sk, skb); 920 | + goto discard; 921 | + } else if (!tp->rx_opt.fec.saw_fec && th->ack && 922 | + sk->sk_state == TCP_LAST_ACK) { 923 | + /* TODO Sometimes the FEC option is not appended to the 924 | + * FIN-ACK packet; socket options cleared? 925 | + */ 926 | + tcp_ack(sk, skb, FLAG_SLOWPATH); 927 | + goto discard; 928 | + } 929 | + } 930 | + 931 | /* Step 1: check sequence number */ 932 | if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) { 933 | /* RFC793, page 37: "In all states except SYN-SENT, all reset 934 | diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c 935 | index 1d0bf2f..acfc144 100644 936 | --- a/net/ipv4/tcp_minisocks.c 937 | +++ b/net/ipv4/tcp_minisocks.c 938 | @@ -483,6 +483,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, 939 | newtp->fastopen_rsk = NULL; 940 | newtp->syn_data_acked = 0; 941 | 942 | + newtp->high_seq = newtp->snd_nxt; 943 | + 944 | /* TCP FEC option */ 945 | newtp->rx_opt.fec.type = sysctl_tcp_fec ? req->fec_type : 0; 946 | newtp->fec.type = newtp->fec.flags = 0; 947 | -- 948 | 2.1.0.rc2.206.gedb03e5 949 | 950 | -------------------------------------------------------------------------------- /v3.10/0003-net-tcp-TCP-with-Forward-Error-Correction-Sender.patch: -------------------------------------------------------------------------------- 1 | From 928d69eec0343e39ecb3560e095b5c16d8d9977a Mon Sep 17 00:00:00 2001 2 | From: Tobias Flach 3 | Date: Mon, 25 Aug 2014 16:47:33 -0700 4 | Subject: [PATCH] net-tcp: TCP with Forward Error Correction (Sender) 5 | 6 | Implementation of the sender part of forward error correction in TCP. 7 | 8 | Implemented components: 9 | * FEC payload construction and transmission (encoding): 10 | - The FEC mechanism is invoked after 1/4 RTT after a transmission 11 | (can be a GSO/TSO packet). 12 | - The encoding scheme is negotiated during connection setup. In 13 | the case of the basic XOR, it XORs all byte blocks (ignoring packet 14 | boundaries, but using the current MSS as the block size) which 15 | were already transmitted but never FEC-encoded before. 16 | Depending on the specified maximum number of bytes per FEC payload 17 | (see FEC_RCV_QUEUE_LIMIT), it is possible that multiple FEC packets are 18 | generated in this step. 19 | - Currently, the FEC option carries the length of the sequence 20 | range used for encoding (that is, sequence number of the last 21 | encoded byte minus the sequence number of the first encoded byte). 22 | This is sufficient to determine the length of 23 | all encoded blocks on the receiver side (all blocks are MSS 24 | bytes large, except for the last one). 25 | * sysctl_tcp_fec extension to toggle FEC transmit during loss episodes: 26 | - Valid values are: 27 | 0 FEC is disabled 28 | 1 FEC is enabled except for loss episodes 29 | 2 FEC is enabled including for loss episodes 30 | --- 31 | include/net/inet_connection_sock.h | 4 +- 32 | include/net/tcp_fec.h | 26 +++ 33 | net/ipv4/inet_diag.c | 3 +- 34 | net/ipv4/sysctl_net_ipv4.c | 3 +- 35 | net/ipv4/tcp_fec.c | 396 +++++++++++++++++++++++++++++++++++++ 36 | net/ipv4/tcp_input.c | 6 + 37 | net/ipv4/tcp_ipv4.c | 3 +- 38 | net/ipv4/tcp_output.c | 5 +- 39 | net/ipv4/tcp_timer.c | 14 +- 40 | 9 files changed, 454 insertions(+), 6 deletions(-) 41 | 42 | diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h 43 | index de2c785..d13c597 100644 44 | --- a/include/net/inet_connection_sock.h 45 | +++ b/include/net/inet_connection_sock.h 46 | @@ -135,6 +135,7 @@ struct inet_connection_sock { 47 | #define ICSK_TIME_PROBE0 3 /* Zero window probe timer */ 48 | #define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */ 49 | #define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */ 50 | +#define ICSK_TIME_FEC 6 /* FEC delayed send timer */ 51 | 52 | static inline struct inet_connection_sock *inet_csk(const struct sock *sk) 53 | { 54 | @@ -225,7 +226,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, 55 | } 56 | 57 | if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 || 58 | - what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE) { 59 | + what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE || 60 | + what == ICSK_TIME_FEC) { 61 | icsk->icsk_pending = what; 62 | icsk->icsk_timeout = jiffies + when; 63 | sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout); 64 | diff --git a/include/net/tcp_fec.h b/include/net/tcp_fec.h 65 | index 1660e58..38f2c40 100644 66 | --- a/include/net/tcp_fec.h 67 | +++ b/include/net/tcp_fec.h 68 | @@ -12,6 +12,9 @@ 69 | 70 | #define TCP_FEC_NUM_TYPES 3 71 | 72 | +/* Delay transmission of FEC packets (delay defined in tcp_fec_arm_timer()) */ 73 | +#define TCP_FEC_DELAYED_SEND 1 74 | + 75 | /* 76 | * Returns true if FEC is enabled for the socket 77 | */ 78 | @@ -77,4 +80,27 @@ int tcp_fec_update_queue(struct sock *sk, struct sk_buff *skb); 79 | */ 80 | void tcp_fec_disable(struct sock *sk); 81 | 82 | +/* Arms the timer for a delayed FEC transmission if there is 83 | + * no earlier timeout defined (i.e. retransmission timeout) 84 | + */ 85 | +void tcp_fec_arm_timer(struct sock *sk); 86 | + 87 | +/* The FEC timer fired. Force an FEC transmission for the 88 | + * last unencoded burst. Rearm the RTO timer (which was switched 89 | + * out when setting the FEC timer). Set a new FEC timer if there 90 | + * is pending unencoded data. 91 | + */ 92 | +void tcp_fec_timer(struct sock *sk); 93 | + 94 | +/* If FEC packets transmissions are delayed set a timer 95 | + * (if not already set), otherwise invoke the FEC mechanism 96 | + * immediately 97 | + */ 98 | +int tcp_fec_invoke(struct sock *sk); 99 | + 100 | +/* Invoke the FEC mechanism set for the connection; 101 | + * Create and sends out FEC packets 102 | + */ 103 | +int tcp_fec_invoke_nodelay(struct sock *sk); 104 | + 105 | #endif 106 | diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c 107 | index 5f64875..e151283 100644 108 | --- a/net/ipv4/inet_diag.c 109 | +++ b/net/ipv4/inet_diag.c 110 | @@ -160,7 +160,8 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, 111 | 112 | if (icsk->icsk_pending == ICSK_TIME_RETRANS || 113 | icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || 114 | - icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { 115 | + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE || 116 | + icsk->icsk_pending == ICSK_TIME_FEC) { 117 | r->idiag_timer = 1; 118 | r->idiag_retrans = icsk->icsk_retransmits; 119 | r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout); 120 | diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c 121 | index 42ea051..389900f 100644 122 | --- a/net/ipv4/sysctl_net_ipv4.c 123 | +++ b/net/ipv4/sysctl_net_ipv4.c 124 | @@ -28,6 +28,7 @@ 125 | 126 | static int zero; 127 | static int one = 1; 128 | +static int two = 2; 129 | static int four = 4; 130 | static int tcp_retr1_max = 255; 131 | static int ip_local_port_range_min[] = { 1, 1 }; 132 | @@ -778,7 +779,7 @@ static struct ctl_table ipv4_table[] = { 133 | .mode = 0644, 134 | .proc_handler = proc_dointvec, 135 | .extra1 = &zero, 136 | - .extra2 = &one, 137 | + .extra2 = &two, 138 | }, 139 | { } 140 | }; 141 | diff --git a/net/ipv4/tcp_fec.c b/net/ipv4/tcp_fec.c 142 | index 3a8bd6d..7f04e49 100644 143 | --- a/net/ipv4/tcp_fec.c 144 | +++ b/net/ipv4/tcp_fec.c 145 | @@ -19,12 +19,30 @@ static struct sk_buff *tcp_fec_make_decoded_pkt(struct sock *sk, 146 | const struct sk_buff *skb, unsigned char *dec_data, 147 | u32 seq, unsigned int len); 148 | 149 | +/* Sender routines */ 150 | +static int tcp_fec_create(struct sock *sk, struct sk_buff_head *list); 151 | +static int tcp_fec_create_xor(struct sock *sk, struct sk_buff_head *list, 152 | + unsigned int first_seq, unsigned int block_len, 153 | + unsigned int block_skip, 154 | + unsigned int max_encoded_per_pkt); 155 | +static struct sk_buff *tcp_fec_make_encoded_pkt(struct sock *sk, 156 | + struct tcp_fec *fec, unsigned char *enc_data, 157 | + u32 seq); 158 | +static int tcp_fec_xmit_all(struct sock *sk, struct sk_buff_head *list); 159 | +static int tcp_fec_xmit(struct sock *sk, struct sk_buff *skb); 160 | + 161 | /* Buffer access routine */ 162 | static unsigned int tcp_fec_get_next_block(struct sock *sk, 163 | struct sk_buff **skb, struct sk_buff_head *queue, 164 | u32 seq, unsigned int block_len, 165 | unsigned char *block); 166 | 167 | +/* Have to define this signature here since the actual function was static 168 | + * and tcp_output.c has no corresponding header file 169 | + */ 170 | +extern int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, 171 | + gfp_t gfp_mask); 172 | + 173 | /* Decodes FEC parameters and stores them in the FEC struct 174 | * @seq - sequence number of the packet 175 | * @ack_seq - ACKed sequence number 176 | @@ -854,3 +872,381 @@ static unsigned int tcp_fec_get_next_block(struct sock *sk, 177 | 178 | return cur_len; 179 | } 180 | + 181 | +/* Arms the timer for a delayed FEC transmission if there is 182 | + * no earlier timeout defined (i.e. retransmission timeout) 183 | + */ 184 | +void tcp_fec_arm_timer(struct sock *sk) 185 | +{ 186 | + struct inet_connection_sock *icsk; 187 | + struct tcp_sock *tp; 188 | + u32 delta, timeout, rtt; 189 | + 190 | + icsk = inet_csk(sk); 191 | + tp = tcp_sk(sk); 192 | + 193 | + /* Only arm a timer if connection is established */ 194 | + if (sk->sk_state != TCP_ESTABLISHED) 195 | + return; 196 | + 197 | + /* Forward next sequence to be encoded if unencoded data was acked */ 198 | + if (after(tp->snd_una, tp->fec.next_seq)) 199 | + tp->fec.next_seq = tp->snd_una; 200 | + 201 | + /* Don't arm the timer if there is no unencoded data left */ 202 | + if (!before(tp->fec.next_seq, tp->snd_nxt)) 203 | + return; 204 | + 205 | + /* TODO handle other timers which might be armed; 206 | + * EARLY_RETRANS? LOSS_PROBE? 207 | + */ 208 | + 209 | + /* Compute timeout (currently 0.25 * RTT) */ 210 | + rtt = tp->srtt >> 3; 211 | + timeout = rtt >> 2; 212 | + 213 | + /* Compute delay between transmission of original packet and this call 214 | + * (difference is subtracted from timeout value) 215 | + */ 216 | + delta = 0; 217 | + if (delta > timeout) { 218 | + tcp_fec_invoke_nodelay(sk); 219 | + return; 220 | + } else if (delta > 0) { 221 | + timeout -= delta; 222 | + } 223 | + 224 | + /* Do not replace a timeout occurring earlier */ 225 | + if (jiffies + timeout >= icsk->icsk_timeout) 226 | + return; 227 | + 228 | + inet_csk_reset_xmit_timer(sk, ICSK_TIME_FEC, timeout, TCP_RTO_MAX); 229 | +} 230 | + 231 | +/* The FEC timer fired. Force an FEC transmission for the 232 | + * last unencoded burst. Rearm the RTO timer (which was switched 233 | + * out when setting the FEC timer). Set a new FEC timer if there 234 | + * is pending unencoded data. 235 | + */ 236 | +void tcp_fec_timer(struct sock *sk) 237 | +{ 238 | + struct inet_connection_sock *icsk; 239 | + struct tcp_sock *tp; 240 | + 241 | + icsk = inet_csk(sk); 242 | + tp = tcp_sk(sk); 243 | + 244 | + tcp_fec_invoke_nodelay(sk); 245 | + 246 | + icsk->icsk_pending = 0; 247 | + tcp_rearm_rto(sk); 248 | + 249 | + tcp_fec_arm_timer(sk); 250 | +} 251 | + 252 | +/* If FEC packet transmissions are delayed set a timer 253 | + * (if not already set), otherwise invoke the FEC mechanism 254 | + * immediately 255 | + */ 256 | +int tcp_fec_invoke(struct sock *sk) 257 | +{ 258 | + struct inet_connection_sock *icsk; 259 | + struct tcp_sock *tp; 260 | + 261 | + icsk = inet_csk(sk); 262 | + tp = tcp_sk(sk); 263 | + 264 | +#ifndef TCP_FEC_DELAYED_SEND 265 | + return tcp_fec_invoke_nodelay(sk); 266 | +#else 267 | + /* Set the timer for sending an FEC packet if no FEC 268 | + * timer is active yet 269 | + */ 270 | + if (!icsk->icsk_pending || icsk->icsk_pending != ICSK_TIME_FEC) 271 | + tcp_fec_arm_timer(sk); 272 | +#endif 273 | + 274 | + return 0; 275 | +} 276 | + 277 | +/* Invokes the FEC mechanism set for the connection; 278 | + * Creates and sends out FEC packets 279 | + */ 280 | +int tcp_fec_invoke_nodelay(struct sock *sk) 281 | +{ 282 | + int err; 283 | + struct sk_buff_head *list; 284 | + struct sk_buff *skb; 285 | + struct tcp_fec *fec; 286 | + 287 | + list = kmalloc(sizeof(struct sk_buff_head), GFP_ATOMIC); 288 | + if (list == NULL) 289 | + return -ENOMEM; 290 | + 291 | + skb_queue_head_init(list); 292 | + err = tcp_fec_create(sk, list); 293 | + if (err) 294 | + goto clean; 295 | + 296 | + err = tcp_fec_xmit_all(sk, list); 297 | + if (err) 298 | + goto clean; 299 | + 300 | +clean: 301 | + /* Purge all SKBs (purge FEC structs first) */ 302 | + skb = (struct sk_buff *) list; 303 | + while (!skb_queue_is_last(list, skb)) { 304 | + skb = skb_queue_next(list, skb); 305 | + fec = TCP_SKB_CB(skb)->fec; 306 | + if (fec != NULL) { 307 | + kfree(fec); 308 | + TCP_SKB_CB(skb)->fec = NULL; 309 | + } 310 | + } 311 | + 312 | + skb_queue_purge(list); 313 | + kfree(list); 314 | + 315 | + /* TODO error handling; -ENOMEM, etc. - disable FEC? */ 316 | + 317 | + return err; 318 | +} 319 | + 320 | +/* Creates one or more FEC packets (can depend on the FEC type used) 321 | + * and puts them in a queue 322 | + * @list: queue head 323 | + */ 324 | +static int tcp_fec_create(struct sock *sk, struct sk_buff_head *list) 325 | +{ 326 | + struct tcp_sock *tp; 327 | + unsigned int first_seq, block_len; 328 | + int err; 329 | + 330 | + tp = tcp_sk(sk); 331 | + 332 | + /* Update the pointer to the first byte to be encoded next 333 | + * (this only matters when a packet was ACKed before it was 334 | + * encoded) 335 | + */ 336 | + if (after(tp->snd_una, tp->fec.next_seq)) 337 | + tp->fec.next_seq = tp->snd_una; 338 | + 339 | + first_seq = tp->fec.next_seq; 340 | + block_len = tcp_current_mss(sk); 341 | + 342 | + switch (tp->fec.type) { 343 | + case TCP_FEC_TYPE_NONE: 344 | + return 0; 345 | + case TCP_FEC_TYPE_XOR_ALL: 346 | + return tcp_fec_create_xor(sk, list, first_seq, 347 | + block_len, 0, 348 | + FEC_RCV_QUEUE_LIMIT - block_len); 349 | + case TCP_FEC_TYPE_XOR_SKIP_1: 350 | + err = tcp_fec_create_xor(sk, list, first_seq, block_len, 1, 351 | + FEC_RCV_QUEUE_LIMIT - block_len); 352 | + if (err) 353 | + return err; 354 | + 355 | + return tcp_fec_create_xor(sk, list, first_seq + block_len, 356 | + block_len, 1, 357 | + FEC_RCV_QUEUE_LIMIT - block_len); 358 | + } 359 | + 360 | + return 0; 361 | +} 362 | + 363 | +/* Creates FEC packet(s) using XOR encoding 364 | + * (allocates memory for the FEC structs) 365 | + * @first_seq - Sequence number of first byte to be encoded 366 | + * @block_len - Block length (typically MSS) 367 | + * @block_skip - Number of unencoded blocks between two encoded blocks 368 | + * @max_encoded_per_pkt - maximum number of blocks encoded per packet 369 | + * (0, if unlimited) 370 | + */ 371 | +static int tcp_fec_create_xor(struct sock *sk, struct sk_buff_head *list, 372 | + unsigned int first_seq, unsigned int block_len, 373 | + unsigned int block_skip, 374 | + unsigned int max_encoded_per_pkt) 375 | +{ 376 | + struct tcp_sock *tp; 377 | + struct sk_buff *skb, *fskb; 378 | + struct tcp_fec *fec; 379 | + unsigned int c_encoded; /* Number of currently encoded blocks 380 | + not yet added to an FEC packet */ 381 | + unsigned int next_seq; /* Next byte to encode */ 382 | + unsigned int i; 383 | + unsigned char *data, *block; 384 | + u16 data_len; 385 | + 386 | + tp = tcp_sk(sk); 387 | + skb = NULL; 388 | + c_encoded = 0; 389 | + next_seq = first_seq; 390 | + 391 | + /* memory allocation 392 | + * data - used temporarily to obtain byte blocks and store the payload 393 | + (is freed before returning; we need two blocks here to store 394 | + the previously XORed data that has not been added to an FEC 395 | + packet yet, and the new to-be XORed data extracted from one 396 | + or more existing buffers) 397 | + 398 | + * fec - used to store the FEC parameters 399 | + (is freed after the corresponding packet is forwarded to the 400 | + transmission routine) 401 | + */ 402 | + data = kmalloc(2 * block_len, GFP_ATOMIC); 403 | + if (data == NULL) 404 | + return -ENOMEM; 405 | + 406 | + fec = kmalloc(sizeof(struct tcp_fec), GFP_ATOMIC); 407 | + if (fec == NULL) { 408 | + kfree(data); 409 | + return -ENOMEM; 410 | + } 411 | + 412 | + memset(data, 0, 2 * block_len); 413 | + memset(fec, 0, sizeof(struct tcp_fec)); 414 | + 415 | + block = data + block_len; 416 | + 417 | + /* encode data blocks 418 | + * XXX atomicity check? 419 | + */ 420 | + fec->enc_seq = next_seq; 421 | + while ((data_len = tcp_fec_get_next_block(sk, &skb, 422 | + &sk->sk_write_queue, next_seq, 423 | + min(block_len, tp->snd_nxt - next_seq), 424 | + block))) { 425 | + /* Check if we reached the encoding limit; then create packet 426 | + * with current payload and add it to the queue 427 | + */ 428 | + if (max_encoded_per_pkt > 0 && 429 | + c_encoded >= max_encoded_per_pkt) { 430 | + fskb = tcp_fec_make_encoded_pkt(sk, fec, data, 431 | + block_len); 432 | + if (fskb == NULL) { 433 | + kfree(data); 434 | + kfree(fec); 435 | + return -EINVAL; 436 | + } 437 | + 438 | + skb_queue_tail(list, fskb); 439 | + memset(data, 0, block_len); 440 | + c_encoded = 0; 441 | + 442 | + /* memory allocation for the FEC struct of the next 443 | + * packet 444 | + */ 445 | + fec = kmalloc(sizeof(struct tcp_fec), GFP_ATOMIC); 446 | + if (fec == NULL) { 447 | + kfree(data); 448 | + return -ENOMEM; 449 | + } 450 | + 451 | + memset(fec, 0, sizeof(struct tcp_fec)); 452 | + fec->enc_seq = next_seq; 453 | + } 454 | + 455 | + next_seq += data_len; 456 | + fec->enc_len = next_seq - fec->enc_seq; 457 | + 458 | + /* encode block into existing payload (XOR) */ 459 | + for (i = 0; i < data_len; i++) 460 | + data[i] ^= block[i]; 461 | + 462 | + c_encoded++; 463 | + 464 | + /* skip over blocks which are not requested for encoding */ 465 | + next_seq += block_len * block_skip; 466 | + } 467 | + 468 | + /* create final packet if some data was selected for encoding */ 469 | + if (c_encoded > 0) { 470 | + fskb = tcp_fec_make_encoded_pkt(sk, fec, data, block_len); 471 | + if (fskb == NULL) { 472 | + kfree(data); 473 | + kfree(fec); 474 | + return -EINVAL; 475 | + } 476 | + 477 | + skb_queue_tail(list, fskb); 478 | + } else { 479 | + kfree(fec); 480 | + } 481 | + 482 | + tp->fec.next_seq = next_seq; 483 | + kfree(data); 484 | + 485 | + return 0; 486 | +} 487 | + 488 | +/* Allocates an SKB for data we want to send and assigns 489 | + * the necessary options and fields 490 | + */ 491 | +static struct sk_buff *tcp_fec_make_encoded_pkt(struct sock *sk, 492 | + struct tcp_fec *fec, 493 | + unsigned char *enc_data, 494 | + unsigned int len) 495 | +{ 496 | + struct sk_buff *skb; 497 | + unsigned char *data; 498 | + 499 | + /* See tcp_make_synack(); 15 probably for tail pointer etc.? */ 500 | + len = min(len, fec->enc_len); 501 | + skb = alloc_skb(MAX_TCP_HEADER + 15 + len, GFP_ATOMIC); 502 | + if (skb == NULL) 503 | + return NULL; 504 | + 505 | + /* Reserve space for headers */ 506 | + skb_reserve(skb, MAX_TCP_HEADER); 507 | + 508 | + /* Specify sequence number and FEC struct address in control buffer */ 509 | + fec->flags |= TCP_FEC_ENCODED; 510 | + TCP_SKB_CB(skb)->seq = fec->enc_seq; 511 | + TCP_SKB_CB(skb)->fec = fec; 512 | + 513 | + /* Enable ACK flag (required for all data packets) */ 514 | + TCP_SKB_CB(skb)->tcp_flags = TCPHDR_ACK; 515 | + 516 | + /* Set GSO parameters */ 517 | + skb_shinfo(skb)->gso_segs = 1; 518 | + skb_shinfo(skb)->gso_size = 0; 519 | + skb_shinfo(skb)->gso_type = 0; 520 | + 521 | + /* Append payload to SKB */ 522 | + data = skb_put(skb, len); 523 | + memcpy(data, enc_data, len); 524 | + 525 | + skb->ip_summed = CHECKSUM_PARTIAL; 526 | + 527 | + return skb; 528 | +} 529 | + 530 | +/* Transmit all FEC packets in a list */ 531 | +static int tcp_fec_xmit_all(struct sock *sk, struct sk_buff_head *list) 532 | +{ 533 | + struct sk_buff *skb; 534 | + int err; 535 | + 536 | + if (list == NULL || skb_queue_empty(list)) 537 | + return 0; 538 | + 539 | + skb = (struct sk_buff *) list; 540 | + while (!skb_queue_is_last(list, skb)) { 541 | + skb = skb_queue_next(list, skb); 542 | + err = tcp_fec_xmit(sk, skb); 543 | + if (err) 544 | + return err; 545 | + } 546 | + 547 | + return 0; 548 | +} 549 | + 550 | +/* Transmits an FEC packet */ 551 | +static int tcp_fec_xmit(struct sock *sk, struct sk_buff *skb) 552 | +{ 553 | + /* TODO timers? no retransmissions, but want to deactivate FEC 554 | + * if we never get any FEC ACKs back 555 | + */ 556 | + return tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); 557 | +} 558 | diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c 559 | index 4d17c5f..e1b70bf 100644 560 | --- a/net/ipv4/tcp_input.c 561 | +++ b/net/ipv4/tcp_input.c 562 | @@ -2938,6 +2938,12 @@ void tcp_rearm_rto(struct sock *sk) 563 | if (tp->fastopen_rsk) 564 | return; 565 | 566 | + /* Don't rearm the timer if an FEC timer is active. 567 | + * The FEC handler will rearm the timer once the event is handled. 568 | + */ 569 | + if (icsk->icsk_pending == ICSK_TIME_FEC) 570 | + return; 571 | + 572 | if (!tp->packets_out) { 573 | inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS); 574 | } else { 575 | diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c 576 | index 04e3bf0..d97fb04 100644 577 | --- a/net/ipv4/tcp_ipv4.c 578 | +++ b/net/ipv4/tcp_ipv4.c 579 | @@ -2667,7 +2667,8 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len) 580 | 581 | if (icsk->icsk_pending == ICSK_TIME_RETRANS || 582 | icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS || 583 | - icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { 584 | + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE || 585 | + icsk->icsk_pending == ICSK_TIME_FEC) { 586 | timer_active = 1; 587 | timer_expires = icsk->icsk_timeout; 588 | } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) { 589 | diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c 590 | index 00daf84..cba34c0 100644 591 | --- a/net/ipv4/tcp_output.c 592 | +++ b/net/ipv4/tcp_output.c 593 | @@ -864,7 +864,7 @@ void tcp_wfree(struct sk_buff *skb) 594 | * We are working here with either a clone of the original 595 | * SKB, or a fresh unique copy made by the retransmit engine. 596 | */ 597 | -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, 598 | +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, 599 | gfp_t gfp_mask) 600 | { 601 | const struct inet_connection_sock *icsk = inet_csk(sk); 602 | @@ -1936,6 +1936,9 @@ repair: 603 | break; 604 | } 605 | 606 | + if (tcp_fec_is_enabled(tp)) 607 | + tcp_fec_invoke(sk); 608 | + 609 | if (likely(sent_pkts)) { 610 | if (tcp_in_cwnd_reduction(sk)) 611 | tp->prr_out += sent_pkts; 612 | diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c 613 | index 4b85e6f..b2808d4 100644 614 | --- a/net/ipv4/tcp_timer.c 615 | +++ b/net/ipv4/tcp_timer.c 616 | @@ -21,6 +21,7 @@ 617 | #include 618 | #include 619 | #include 620 | +#include 621 | 622 | int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES; 623 | int sysctl_tcp_synack_retries __read_mostly = TCP_SYNACK_RETRIES; 624 | @@ -472,7 +473,15 @@ out_reset_timer: 625 | if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0)) 626 | __sk_dst_reset(sk); 627 | 628 | -out:; 629 | +out: 630 | + /* FEC will switch out the RTO timer if a delayed FEC transmission 631 | + * should happen earlier than this. RTO timer will be switched in 632 | + * once the FEC timer fired. 633 | + * FEC transmissions during a loss episode require that the sysctl 634 | + * value is >= 2. 635 | + */ 636 | + if (tcp_fec_is_enabled(tp) && sysctl_tcp_fec >= 2) 637 | + tcp_fec_arm_timer(sk); 638 | } 639 | 640 | void tcp_write_timer_handler(struct sock *sk) 641 | @@ -497,6 +506,9 @@ void tcp_write_timer_handler(struct sock *sk) 642 | case ICSK_TIME_LOSS_PROBE: 643 | tcp_send_loss_probe(sk); 644 | break; 645 | + case ICSK_TIME_FEC: 646 | + tcp_fec_timer(sk); 647 | + break; 648 | case ICSK_TIME_RETRANS: 649 | icsk->icsk_pending = 0; 650 | tcp_retransmit_timer(sk); 651 | -- 652 | 2.1.0.rc2.206.gedb03e5 653 | 654 | --------------------------------------------------------------------------------