![]()
21 | 22 |
I wish to thank the members of the boost mailing list, whose comments, 63 | compliments, and criticisms during both the development and formal review 64 | helped make the Tokenizer library what it is. I especially wish to thank 65 | Aleksey Gurtovoy for the idea of using a pair of iterators to specify the 66 | input, instead of a string. I also wish to thank Jeremy Siek for his idea 67 | of providing a container interface for the token iterators and for 68 | simplifying the template parameters for the TokenizerFunctions. He and 69 | Daryle Walker also emphasized the need to separate interface and 70 | implementation. Gary Powell sparked the idea of using the isspace and 71 | ispunct as the defaults for char_delimiters_separator. Jeff Garland 72 | provided ideas on how to change to order of the template parameters in 73 | order to make tokenizer easier to declare. Thanks to Douglas Gregor who 74 | served as review manager and provided many insights both on the boost list 75 | and in e-mail on how to polish up the implementation and presentation of 76 | Tokenizer. Finally, thanks to Beman Dawes who integrated the final version 77 | into the boost distribution.
78 |Revised 85 | 25 86 | December, 2006
87 | 88 |Copyright © 2000 Jeremy Siek
89 | Copyright © 2001 John R. Bandela
Distributed under the Boost Software License, Version 1.0. (See 92 | accompanying file LICENSE_1_0.txt or 93 | copy at http://www.boost.org/LICENSE_1_0.txt)
95 | 96 | 97 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | * text=auto !eol svneol=native#text/plain 2 | *.gitattributes text svneol=native#text/plain 3 | 4 | # Scriptish formats 5 | *.bat text svneol=native#text/plain 6 | *.bsh text svneol=native#text/x-beanshell 7 | *.cgi text svneol=native#text/plain 8 | *.cmd text svneol=native#text/plain 9 | *.js text svneol=native#text/javascript 10 | *.php text svneol=native#text/x-php 11 | *.pl text svneol=native#text/x-perl 12 | *.pm text svneol=native#text/x-perl 13 | *.py text svneol=native#text/x-python 14 | *.sh eol=lf svneol=LF#text/x-sh 15 | configure eol=lf svneol=LF#text/x-sh 16 | 17 | # Image formats 18 | *.bmp binary svneol=unset#image/bmp 19 | *.gif binary svneol=unset#image/gif 20 | *.ico binary svneol=unset#image/ico 21 | *.jpeg binary svneol=unset#image/jpeg 22 | *.jpg binary svneol=unset#image/jpeg 23 | *.png binary svneol=unset#image/png 24 | *.tif binary svneol=unset#image/tiff 25 | *.tiff binary svneol=unset#image/tiff 26 | *.svg text svneol=native#image/svg%2Bxml 27 | 28 | # Data formats 29 | *.pdf binary svneol=unset#application/pdf 30 | *.avi binary svneol=unset#video/avi 31 | *.doc binary svneol=unset#application/msword 32 | *.dsp text svneol=crlf#text/plain 33 | *.dsw text svneol=crlf#text/plain 34 | *.eps binary svneol=unset#application/postscript 35 | *.gz binary svneol=unset#application/gzip 36 | *.mov binary svneol=unset#video/quicktime 37 | *.mp3 binary svneol=unset#audio/mpeg 38 | *.ppt binary svneol=unset#application/vnd.ms-powerpoint 39 | *.ps binary svneol=unset#application/postscript 40 | *.psd binary svneol=unset#application/photoshop 41 | *.rdf binary svneol=unset#text/rdf 42 | *.rss text svneol=unset#text/xml 43 | *.rtf binary svneol=unset#text/rtf 44 | *.sln text svneol=native#text/plain 45 | *.swf binary svneol=unset#application/x-shockwave-flash 46 | *.tgz binary svneol=unset#application/gzip 47 | *.vcproj text svneol=native#text/xml 48 | *.vcxproj text svneol=native#text/xml 49 | *.vsprops text svneol=native#text/xml 50 | *.wav binary svneol=unset#audio/wav 51 | *.xls binary svneol=unset#application/vnd.ms-excel 52 | *.zip binary svneol=unset#application/zip 53 | 54 | # Text formats 55 | .htaccess text svneol=native#text/plain 56 | *.bbk text svneol=native#text/xml 57 | *.cmake text svneol=native#text/plain 58 | *.css text svneol=native#text/css 59 | *.dtd text svneol=native#text/xml 60 | *.htm text svneol=native#text/html 61 | *.html text svneol=native#text/html 62 | *.ini text svneol=native#text/plain 63 | *.log text svneol=native#text/plain 64 | *.mak text svneol=native#text/plain 65 | *.qbk text svneol=native#text/plain 66 | *.rst text svneol=native#text/plain 67 | *.sql text svneol=native#text/x-sql 68 | *.txt text svneol=native#text/plain 69 | *.xhtml text svneol=native#text/xhtml%2Bxml 70 | *.xml text svneol=native#text/xml 71 | *.xsd text svneol=native#text/xml 72 | *.xsl text svneol=native#text/xml 73 | *.xslt text svneol=native#text/xml 74 | *.xul text svneol=native#text/xul 75 | *.yml text svneol=native#text/plain 76 | boost-no-inspect text svneol=native#text/plain 77 | CHANGES text svneol=native#text/plain 78 | COPYING text svneol=native#text/plain 79 | INSTALL text svneol=native#text/plain 80 | Jamfile text svneol=native#text/plain 81 | Jamroot text svneol=native#text/plain 82 | Jamfile.v2 text svneol=native#text/plain 83 | Jamrules text svneol=native#text/plain 84 | Makefile* text svneol=native#text/plain 85 | README text svneol=native#text/plain 86 | TODO text svneol=native#text/plain 87 | 88 | # Code formats 89 | *.c text svneol=native#text/plain 90 | *.cpp text svneol=native#text/plain 91 | *.h text svneol=native#text/plain 92 | *.hpp text svneol=native#text/plain 93 | *.ipp text svneol=native#text/plain 94 | *.tpp text svneol=native#text/plain 95 | *.jam text svneol=native#text/plain 96 | *.java text svneol=native#text/plain 97 | -------------------------------------------------------------------------------- /include/boost/token_iterator.hpp: -------------------------------------------------------------------------------- 1 | // Boost token_iterator.hpp -------------------------------------------------// 2 | 3 | // Copyright John R. Bandela 2001 4 | // Distributed under the Boost Software License, Version 1.0. (See 5 | // accompanying file LICENSE_1_0.txt or copy at 6 | // http://www.boost.org/LICENSE_1_0.txt) 7 | 8 | // See http://www.boost.org/libs/tokenizer for documentation. 9 | 10 | // Revision History: 11 | // 16 Jul 2003 John Bandela 12 | // Allowed conversions from convertible base iterators 13 | // 03 Jul 2003 John Bandela 14 | // Converted to new iterator adapter 15 | 16 | 17 | 18 | #ifndef BOOST_TOKENIZER_POLICY_JRB070303_HPP_ 19 | #define BOOST_TOKENIZER_POLICY_JRB070303_HPP_ 20 | 21 | #include![]()
The Boost Tokenizer package provides a flexible and 20 | easy-to-use way to break a string or other character sequence into a series 21 | of tokens. Below is a simple example that will break up a phrase into 22 | words.
23 | 24 |
26 | // simple_example_1.cpp
27 | #include<iostream>
28 | #include<boost/tokenizer.hpp>
29 | #include<string>
30 |
31 | int main(){
32 | using namespace std;
33 | using namespace boost;
34 | string s = "This is, a test";
35 | tokenizer<> tok(s);
36 | for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
37 | cout << *beg << "\n";
38 | }
39 | }
40 |
41 | You can choose how the string gets parsed by using the 44 | TokenizerFunction. If you do not specify anything, the default 45 | TokenizerFunction is char_delimiters_separator<char> which 46 | defaults to breaking up a string based on space and punctuation. Here is an 47 | example using another TokenizerFunction called 48 | escaped_list_separator. This TokenizerFunction parses a superset 49 | of comma-separated value (CSV) lines. The format looks like this:
50 | 51 |Field 1,"putting quotes around fields, allows commas",Field 52 | 3
53 | 54 |Below is an example that will break the previous line into 55 | its three fields.
56 | 57 |
59 | // simple_example_2.cpp
60 | #include<iostream>
61 | #include<boost/tokenizer.hpp>
62 | #include<string>
63 |
64 | int main(){
65 | using namespace std;
66 | using namespace boost;
67 | string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
68 | tokenizer<escaped_list_separator<char> > tok(s);
69 | for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
70 | cout << *beg << "\n";
71 | }
72 | }
73 |
74 | Finally, for some TokenizerFunctions you have to pass 77 | something into the constructor in order to do anything interesting. An 78 | example is the offset_separator. This class breaks a string into tokens based 79 | on offsets. For example, when 12252001 is parsed using offsets of 80 | 2,2,4 it becomes 12 25 2001. Below is the code used.
81 | 82 |
84 | // simple_example_3.cpp
85 | #include<iostream>
86 | #include<boost/tokenizer.hpp>
87 | #include<string>
88 |
89 | int main(){
90 | using namespace std;
91 | using namespace boost;
92 | string s = "12252001";
93 | int offsets[] = {2,2,4};
94 | offset_separator f(offsets, offsets+3);
95 | tokenizer<offset_separator> tok(s,f);
96 | for(tokenizer<offset_separator>::iterator beg=tok.begin(); beg!=tok.end();++beg){
97 | cout << *beg << "\n";
98 | }
99 | }
100 |
101 | 104 |
Revised 111 | 9 June 2010
112 | 113 |Copyright © 2001 John R. Bandela
114 | 115 |Distributed under the Boost Software License, Version 1.0. (See 116 | accompanying file LICENSE_1_0.txt or 117 | copy at http://www.boost.org/LICENSE_1_0.txt)
119 | 120 | 121 | -------------------------------------------------------------------------------- /doc/offset_separator.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |![]()
20 | class offset_separator 21 |22 | 23 |
The offset_separator class is an implementation of the TokenizerFunction concept that can be used with 25 | the tokenizer class to break text up into 26 | tokens. The offset_separator breaks a sequence of Char's 27 | into strings based on a sequence of offsets. For example, if you had the 28 | string "12252001" and offsets (2,2,4) it would break the string into 12 25 29 | 2001. Here is an example.
30 | 31 |
33 | // simple_example_3.cpp
34 | #include<iostream>
35 | #include<boost/tokenizer.hpp>
36 | #include<string>
37 |
38 | int main(){
39 | using namespace std;
40 | using namespace boost;
41 | string s = "12252001";
42 | int offsets[] = {2,2,4};
43 | offset_separator f(offsets, offsets+3);
44 | tokenizer<offset_separator> tok(s,f);
45 | for(tokenizer<offset_separator>::iterator beg=tok.begin(); beg!=tok.end();++beg){
46 | cout << *beg << "\n";
47 | }
48 | }
49 |
50 |
51 | 52 | 53 |
The offset_separator has 1 constructor of interest. (The default 56 | constructor is just there to make some compilers happy). The declaration is 57 | below
58 |59 | template<typename Iter> 60 | offset_separator(Iter begin,Iter end,bool bwrapoffsets = true, bool breturnpartiallast = true) 61 |62 | 63 |
|
66 | Parameter 67 | |
68 |
69 |
70 | Description 71 | |
72 |
| begin, end | 76 | 77 |Specify the sequence of integer offsets. | 78 |
| bwrapoffsets | 82 | 83 |Tells whether to wrap around to the beginning of the offsets when 84 | the all the offsets have been used. For example the string 85 | "1225200101012002" with offsets (2,2,4) with bwrapoffsets to true, 86 | would parse to 12 25 2001 01 01 2002. With bwrapoffsets to false, it 87 | would parse to 12 25 2001 and then stop because all the offsets have 88 | been used. | 89 |
| breturnpartiallast | 93 | 94 |Tells whether, when the parsed sequence terminates before yielding 95 | the number of characters in the current offset, to create a token with 96 | what was parsed, or to ignore it. For example the string "122501" with 97 | offsets (2,2,4) with breturnpartiallast set to true will parse to 12 25 98 | 01. With it set to false, it will parse to 12 25 and then will stop 99 | because there are only 2 characters left in the sequence instead of the 100 | 4 that should have been there. | 101 |
To use this class, pass an object of it anywhere a TokenizerFunction is 105 | required. If you default constructruct the object, it will just return 106 | every character in the parsed sequence as a token. (ie it defaults to an 107 | offset of 1, and bwrapoffsets is true).
108 | 109 |110 | 111 |
Revised 121 | 25 122 | December, 2006
123 | 124 |Copyright © 2001 John R. Bandela
125 | 126 |Distributed under the Boost Software License, Version 1.0. (See 127 | accompanying file LICENSE_1_0.txt or 128 | copy at http://www.boost.org/LICENSE_1_0.txt)
130 | 131 | 132 | -------------------------------------------------------------------------------- /doc/token_iterator.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |![]()
20 | template < 21 | class TokenizerFunc = char_delimiters_separator<char>, 22 | class Iterator = std::string::const_iterator, 23 | class Type = std::string 24 | > 25 | class token_iterator_generator 26 |27 |
28 | template<class Type, class Iterator, class TokenizerFunc> 29 | typename token_iterator_generator<TokenizerFunc,Iterator,Type>::type 30 | make_token_iterator(Iterator begin, Iterator end,const TokenizerFunc& fun) 31 | 32 |33 | 34 |
The token iterator serves to provide an iterator view of the tokens in a 35 | parsed sequence.
36 | 37 |
39 | /// simple_example_5.cpp
40 | #include<iostream>
41 | #include<boost/token_iterator.hpp>
42 | #include<string>
43 |
44 | int main(){
45 | using namespace std;
46 | using namespace boost;
47 | string s = "12252001";
48 | int offsets[] = {2,2,4};
49 | offset_separator f(offsets, offsets+3);
50 | typedef token_iterator_generator<offset_separator>::type Iter;
51 | Iter beg = make_token_iterator<string>(s.begin(),s.end(),f);
52 | Iter end = make_token_iterator<string>(s.end(),s.end(),f);
53 | // The above statement could also have been what is below
54 | // Iter end;
55 | for(;beg!=end;++beg){
56 | cout << *beg << "\n";
57 | }
58 | }
59 |
60 |
61 | 62 | 63 |
| Parameter | 68 | 69 |Description | 70 |
|---|---|
| TokenizerFunc | 74 | 75 |The TokenizerFunction used to parse the sequence. | 76 |
| Iterator | 80 | 81 |The type of the iterator the specifies the sequence. | 82 |
| Type | 86 | 87 |The type of the token, typically string. | 88 |
The category of Iterator, up to and including Forward Iterator. Anything 94 | higher will get scaled down to Forward Iterator.
95 | 96 ||
101 | Type 102 | |
103 |
104 |
105 | Remarks 106 | |
107 |
| token_iterator_generator::type | 111 | 112 |The type of the token iterator. | 113 |
118 | template<class Type, class Iterator, class TokenizerFunc> 119 | typename token_iterator_generator<TokenizerFunc,Iterator,Type>::type 120 | make_token_iterator(Iterator begin, Iterator end,const TokenizerFunc& fun) 121 |122 | 123 |
|
126 | Parameter 127 | |
128 |
129 |
130 | Description 131 | |
132 |
| begin | 136 | 137 |The beginning of the sequence to be parsed. | 138 |
| end | 142 | 143 |Past the end of the sequence to be parsed. | 144 |
| fun | 148 | 149 |A functor that is a model of TokenizerFunction | 150 |
154 |
Revised 161 | 25 162 | December, 2006
163 | 164 |Copyright © 2001 John R. Bandela
165 | 166 |Distributed under the Boost Software License, Version 1.0. (See 167 | accompanying file LICENSE_1_0.txt or 168 | copy at http://www.boost.org/LICENSE_1_0.txt)
170 | 171 | 172 | -------------------------------------------------------------------------------- /doc/tokenizerfunction.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
A TokenizerFunction is a functor whose purpose is to parse a given 20 | sequence until exactly 1 token has been found or the end is reached. It 21 | then updates the token, and informs the caller of the location in the 22 | sequence of the next element immediately after the last element of the 23 | sequence that was parsed for the current token.
24 | 25 |Assignable, CopyConstructable
28 | 29 || X | 34 | 35 |A type that is a model of TokenizerFunction | 36 |
| func | 40 | 41 |Object of type X | 42 |
| tok | 46 | 47 |Object of Token | 48 |
| next | 52 | 53 |iterator that points to the first unparsed element of the sequence 54 | being parsed | 55 |
| end | 59 | 60 |iterator that points to the past the end of the sequence being 61 | parsed | 62 |
A token is the result of parsing a sequence.
68 | 69 |In addition to the expression in Assignable and CopyConstructable the 72 | following expressions are valid
73 | 74 || Name | 77 | 78 |Expression | 79 | 80 |Return type | 81 |
|---|---|---|
| Functor | 85 | 86 |func(next, end, tok) | 87 | 88 |bool | 89 |
| reset | 93 | 94 |reset() | 95 | 96 |void | 97 |
In addition to the expression semantics in Assignable and 103 | CopyConstructable, TokenizerFunction has the following expression 104 | semantcs
105 | 106 || Name | 109 | 110 |Expression | 111 | 112 |Precondition | 113 | 114 |Semantics | 115 | 116 |Postcondition | 117 |
|---|---|---|---|---|
| operator() | 121 | 122 |func(next, end, tok) | 123 | 124 |next and end are valid iterators to the same 125 | sequence. next is a reference the function is free to modify. tok is 126 | constructed. | 127 | 128 |The return value indicates whether a new token was found in the 129 | sequence [next,end) | 130 | 131 |If the return value is true, the new token is assigned to tok. next 132 | is always updated to the position where parsing should start on the 133 | subsequent call. | 134 |
| reset | 138 | 139 |reset() | 140 | 141 |None | 142 | 143 |Clears out all state variables that are used by the object in 144 | parsing the current sequence. | 145 | 146 |A new sequence to parse can be given. | 147 |
No guarantees. Models of TokenizerFunction are free to define their own 153 | complexity
154 | 155 |165 |
Revised 172 | 25 173 | December, 2006
174 | 175 |Copyright © 2001 John R. Bandela
176 | 177 |Distributed under the Boost Software License, Version 1.0. (See 178 | accompanying file LICENSE_1_0.txt or 179 | copy at http://www.boost.org/LICENSE_1_0.txt)
181 | 182 | 183 | -------------------------------------------------------------------------------- /test/examples.cpp: -------------------------------------------------------------------------------- 1 | // Boost tokenizer examples -------------------------------------------------// 2 | 3 | // (c) Copyright John R. Bandela 2001. 4 | 5 | // Distributed under the Boost Software License, Version 1.0. (See 6 | // accompanying file LICENSE_1_0.txt or copy at 7 | // http://www.boost.org/LICENSE_1_0.txt) 8 | 9 | // See http://www.boost.org for updates, documentation, and revision history. 10 | 11 | #include![]()
21 | template <class Char, class Traits = std::char_traits<Char> >
22 | class char_delimiters_separator{
23 |
24 |
25 | The char_delimiters_separator class is an implementation of the TokenizerFunction concept that can be used to 27 | break text up into tokens. It is the default TokenizerFunction for 28 | tokenizer and token_iterator_generator. An example is below.
29 | 30 |
32 | // simple_example_4.cpp
33 | #include<iostream>
34 | #include<boost/tokenizer.hpp>
35 | #include<string>
36 |
37 | int main(){
38 | using namespace std;
39 | using namespace boost;
40 | string s = "This is, a test";
41 | tokenizer<char_delimiters_separator<char> > tok(s);
42 | for(tokenizer<char_delimiters_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
43 | cout << *beg << "\n";
44 | }
45 | }
46 |
47 |
48 | There is one constructor of interest. It is as follows
51 |52 | explicit char_delimiters_separator(bool return_delims = false, 53 | const Char* returnable = "",const Char* nonreturnable = "" ) 54 |55 | 56 |
|
59 | Parameter 60 | |
61 |
62 |
63 | Description 64 | |
65 |
| return_delims | 69 | 70 |Whether or not to return the delimiters that have been found. Note 71 | that not all delimiters can be returned. See the other two parameters 72 | for explanation. | 73 |
| returnable | 77 | 78 |This specifies the returnable delimiters. These are the delimiters 79 | that can be returned as tokens when return_delims is true. Since these 80 | are typically punctuation, if a 0 is provided as the argument, then the 81 | returnable delmiters will be all characters Cfor which std::ispunct(C) 82 | yields a true value. If an argument of "" is provided, then this is 83 | taken to mean that there are noreturnable delimiters. | 84 |
| nonreturnable | 88 | 89 |This specifies the nonreturnable delimiters. These are delimiters 90 | that cannot be returned as tokens. Since these are typically 91 | whitespace, if 0 is specified as an argument, then the nonreturnable 92 | delimiters will be all characters C for which std::isspace(C) yields a 93 | true value. If an argument of "" is provided, then this is taken to 94 | mean that there are no non-returnable delimiters. | 95 |
The reason there is a distinction between nonreturnable and returnable 99 | delimiters is that some delimiters are just used to split up tokens and are 100 | nothing more. Take for example the following string "b c +". Assume you are 101 | writing a simple calculator to parse expression in post fix notation. While 102 | both the space and the + separate tokens, you only only interested in the + 103 | and not in the space. Indeed having the space returned as a token would 104 | only complicate your code. In this case you would specify + as a 105 | returnable, and space as a nonreturnable delimiter.
106 | 107 |To use this class, pass an object of it anywhere a TokenizerFunction 108 | object is required.
109 | 110 || Parameter | 115 | 116 |Description | 117 |
|---|---|
| Char | 121 | 122 |The type of the elements within a token, typically 123 | char. | 124 |
| Traits | 128 | 129 |The traits class for Char, typically 130 | std::char_traits<Char> | 131 |
139 |
Revised 146 | 25 147 | December, 2006
148 | 149 |Copyright © 2001 John R. Bandela
150 | 151 |Distributed under the Boost Software License, Version 1.0. (See 152 | accompanying file LICENSE_1_0.txt or 153 | copy at http://www.boost.org/LICENSE_1_0.txt)
155 | 156 | 157 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [Boost.Tokenizer](https://boost.org/libs/tokenizer) 2 | 3 | Boost.Tokenizer is a part of [Boost C++ Libraries](https://github.com/boostorg). The Boost.Tokenizer package provides a flexible and easy-to-use way to break a string or other character sequence into a series of tokens. 4 | 5 | ## License 6 | 7 | Distributed under the [Boost Software License, Version 1.0](https://www.boost.org/LICENSE_1_0.txt). 8 | 9 | ## Properties 10 | 11 | * C++11 12 | * Header-Only 13 | 14 | ## Build Status 15 | 16 | 17 | | Branch | GHA CI | Appveyor | Coverity Scan | codecov.io | Deps | Docs | Tests | 18 | | :-------------: | ------ | -------- | ------------- | ---------- | ---- | ---- | ----- | 19 | | [`master`](https://github.com/boostorg/tokenizer/tree/master) | [](https://github.com/boostorg/tokenizer/actions?query=branch:master) | [](https://ci.appveyor.com/project/cppalliance/tokenizer/branch/master) | [](https://scan.coverity.com/projects/boostorg-tokenizer) | [](https://codecov.io/gh/boostorg/tokenizer/tree/master) | [](https://pdimov.github.io/boostdep-report/master/tokenizer.html) | [](https://www.boost.org/doc/libs/master/libs/tokenizer) | [](https://www.boost.org/development/tests/master/developer/tokenizer.html) 20 | | [`develop`](https://github.com/boostorg/tokenizer/tree/develop) | [](https://github.com/boostorg/tokenizer/actions?query=branch:develop) | [](https://ci.appveyor.com/project/cppalliance/tokenizer/branch/develop) | [](https://scan.coverity.com/projects/boostorg-tokenizer) | [](https://codecov.io/gh/boostorg/tokenizer/tree/develop) | [](https://pdimov.github.io/boostdep-report/develop/tokenizer.html) | [](https://www.boost.org/doc/libs/develop/libs/tokenizer) | [](https://www.boost.org/development/tests/develop/developer/tokenizer.html) 21 | 22 | ## Overview 23 | 24 | > break up a phrase into words. 25 | 26 | ![Try it online][badge.wandbox] 27 | 28 | ```c++ 29 | #include![]()
template < 20 | class TokenizerFunc = char_delimiters_separator<char>, 21 | class Iterator = std::string::const_iterator, 22 | class Type = std::string 23 | > 24 | class tokenizer 25 |26 | 27 |
The tokenizer class provides a container view of a series of tokens 28 | contained in a sequence. You set the sequence to parse and the 29 | TokenizerFunction to use to parse the sequence either upon construction or 30 | using the assign member function. Note: No parsing is actually done upon 31 | construction. Parsing is done on demand as the tokens are accessed via the 32 | iterator provided by begin.
33 | 34 |// simple_example_1.cpp
36 | #include<iostream>
37 | #include<boost/tokenizer.hpp>
38 | #include<string>
39 |
40 | int main(){
41 | using namespace std;
42 | using namespace boost;
43 | string s = "This is, a test";
44 | tokenizer<> tok(s);
45 | for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
46 | cout << *beg << "\n";
47 | }
48 | }
49 |
50 |
51 | The output from simple_example_1 is:
52 | 53 |54 | 55 |61 | 62 |59 | 60 |
This
56 | is
57 | a
58 | test
| Parameter | 67 | 68 |Description | 69 |
|---|---|
| TokenizerFunc | 73 | 74 |The TokenizerFunction used to parse the sequence. | 75 |
| Iterator | 79 | 80 |The type of the iterator the specifies the sequence. | 81 |
| Type | 85 | 86 |The type of the token, typically string. | 87 |
91 | 92 |
|
97 | Type 98 | |
99 |
100 |
101 | Remarks 102 | |
103 |
| iterator | 107 | 108 |The type returned by begin and end. Note: the category of iterator 109 | will be at most ForwardIterator. It will be InputIterator if the 110 | Iterator template parameter is an InputIterator. For any other 111 | category, it will be ForwardIterator. | 112 |
| const_iterator | 116 | 117 |Same type as iterator. | 118 |
| value_type | 122 | 123 |Same type as the template parameter Type | 124 |
| reference | 128 | 129 |Same type as value_type& | 130 |
| const_reference | 134 | 135 |Same type as const reference | 136 |
| pointer | 140 | 141 |Same type as value_type* | 142 |
| const_pointer | 146 | 147 |Same type as const pointer | 148 |
| size_type | 152 | 153 |void | 154 |
| difference_type | 158 | 159 |void | 160 |
164 | 165 |
tokenizer(Iterator first, Iterator last,const TokenizerFunc& f = TokenizerFunc()) 167 | 168 | template<class Container> 169 | tokenizer(const Container& c,const TokenizerFunc& f = TokenizerFunc()) 170 | 171 | void assign(Iterator first, Iterator last) 172 | 173 | void assign(Iterator first, Iterator last, const TokenizerFunc& f) 174 | 175 | template<class Container> 176 | void assign(const Container& c) 177 | 178 | template<class Container> 179 | void assign(const Container& c, const TokenizerFunc& f) 180 | 181 | iterator begin() const 182 | 183 | iterator end() const 184 |185 | 186 |
|
189 | Parameter 190 | |
191 |
192 |
193 | Description 194 | |
195 |
| c | 199 | 200 |A container that contains the sequence to parse. Note: c.begin() 201 | and c.end() must be convertible to the template parameter 202 | Iterator. | 203 |
| f | 207 | 208 |A functor that is a model of TokenizerFunction that will be used to 209 | parse the sequence. | 210 |
| first | 214 | 215 |The iterator that represents the beginning position in the sequence 216 | to be parsed. | 217 |
| last | 221 | 222 |The iterator that represents the past the end position in the 223 | sequence to be parsed. | 224 |
228 |
Revised 235 | 16 February, 2008
236 | 237 |Copyright © 2001 John R. Bandela
238 | 239 |Distributed under the Boost Software License, Version 1.0. (See 240 | accompanying file LICENSE_1_0.txt or 241 | copy at http://www.boost.org/LICENSE_1_0.txt)
243 | 244 | 245 | -------------------------------------------------------------------------------- /doc/escaped_list_separator.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
22 | escaped_list_separator<Char, Traits = std::char_traits<Char> > 23 |24 |
The escaped_list_separator class is an implementation of the 27 | TokenizerFunction. The 28 | escaped_list_separator parses a superset of the csv (comma separated value) 29 | format. The examples of this formate are below. It is assumed that the 30 | default characters for separator, quote, and escape are used.
31 | 32 |Field 1,Field 2,Field 3
33 | Field 1,"Field 2, with comma",Field 3
34 | Field 1,Field 2 with \"embedded quote\",Field 3
35 | Field 1, Field 2 with \n new line,Field 3
36 | Field 1, Field 2 with embedded \\ ,Field 3
Fields are normally separated by commas. If you want to put a comma in a 39 | field, you need to put quotes around it. Also 3 escape sequences are 40 | supported
41 | 42 ||
45 | Escape Sequence 46 | |
47 |
48 |
49 | Result 50 | |
51 |
| <escape><quote> | 55 | 56 |<quote> | 57 |
| <escape>n | 61 | 62 |newline | 63 |
| <escape><escape> | 67 | 68 |<escape> | 69 |
Where <quote> is any character specified to be a quote 73 | and<escape> is any character specified to be an escape character.
74 | 75 |
77 | // simple_example_2.cpp
78 | #include<iostream>
79 | #include<boost/tokenizer.hpp>
80 | #include<string>
81 |
82 | int main(){
83 | using namespace std;
84 | using namespace boost;
85 | string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
86 | tokenizer<escaped_list_separator<char> > tok(s);
87 | for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
88 | cout << *beg << "\n";
89 | }
90 | }
91 |
92 |
93 | 94 | 95 |
escaped_list_separator has 2 constructors. They are as follows
98 |99 | explicit escaped_list_separator(Char e = '\\', Char c = ',',Char q = '\"') 100 |101 | 102 |
|
105 | Parameter 106 | |
107 |
108 |
109 | Description 110 | |
111 |
| e | 115 | 116 |Specifies the character to use for escape sequences. It defaults to 117 | the C style \ (backslash). However you can override by passing in a 118 | different character. An example of when you might want to do this is 119 | when you have many fields which are Windows style filenames. Instead of 120 | escaping out each \ in the path, you can change the escape to something 121 | else. | 122 |
| c | 126 | 127 |Specifies the character to use to separate the fields | 128 |
| q | 132 | 133 |Specifies the character to use for the quote. | 134 |
138 |
139 | escaped_list_separator(string_type e, string_type c, string_type q): 140 |141 | 142 |
|
145 | Parameter 146 | |
147 |
148 |
149 | Description 150 | |
151 |
| e | 155 | 156 |Any character in the string e, is considered to be an escape 157 | character. If an empty string is given, then there are no escape 158 | characters. | 159 |
| c | 163 | 164 |Any character in the string c, is considered to be a separator. If 165 | an empty string is given, then there are no separator characters. | 166 |
| q | 170 | 171 |Any character in the string q, is considered to be a quote. If an 172 | empty string is given, then there are no quote characters. | 173 |
177 | 178 |
To use this class, pass an object of it anywhere in the Tokenizer 179 | package where a TokenizerFunction is required.
180 | 181 |182 | 183 |
| Parameter | 188 | 189 |Description | 190 |
|---|---|
| Char | 194 | 195 |The type of the elements within a token, typically 196 | char. | 197 |
| Traits | 201 | 202 |The traits class for the Char type. This is used for comparing 203 | Char's. It defaults to std::char_traits<Char> | 204 |
208 | 209 |
214 |
Revised 221 | 25 222 | December, 2006
223 | 224 |Copyright © 2001 John R. Bandela
225 | 226 |Distributed under the Boost Software License, Version 1.0. (See 227 | accompanying file LICENSE_1_0.txt or 228 | copy at http://www.boost.org/LICENSE_1_0.txt)
230 | 231 | 232 | -------------------------------------------------------------------------------- /doc/char_separator.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |![]()
The char_separator class breaks a sequence of characters into 21 | tokens based on character delimiters much in the same way that 22 | strtok() does (but without all the evils of non-reentrancy and 23 | destruction of the input sequence).
24 | 25 |The char_separator class is used in conjunction with the 26 | token_iterator or tokenizer to perform tokenizing.
28 | 29 |The strtok() function does not include matches with the 32 | character delimiters in the output sequence of tokens. However, sometimes 33 | it is useful to have the delimiters show up in the output sequence, 34 | therefore char_separator provides this as an option. We refer to 35 | delimiters that show up as output tokens as kept delimiters 36 | and delimiters that do now show up as output tokens as dropped 37 | delimiters.
38 | 39 |When two delimiters appear next to each other in the input sequence, 40 | there is the question of whether to output an empty token or 41 | to skip ahead. The behaviour of strtok() is to skip ahead. The 42 | char_separator class provides both options.
43 | 44 |This first examples shows how to use char_separator as a 47 | replacement for the strtok() function. We've specified three 48 | character delimiters, and they will not show up as output tokens. We have 49 | not specified any kept delimiters, and by default any empty tokens will be 50 | ignored.
51 | 52 |
53 |
54 | // char_sep_example_1.cpp
55 | #include <iostream>
56 | #include <boost/tokenizer.hpp>
57 | #include <string>
58 |
59 | int main()
60 | {
61 | std::string str = ";;Hello|world||-foo--bar;yow;baz|";
62 | typedef boost::tokenizer<boost::char_separator<char> >
63 | tokenizer;
64 | boost::char_separator<char> sep("-;|");
65 | tokenizer tokens(str, sep);
66 | for (tokenizer::iterator tok_iter = tokens.begin();
67 | tok_iter != tokens.end(); ++tok_iter)
68 | std::cout << "<" << *tok_iter << "> ";
69 | std::cout << "\n";
70 | return EXIT_SUCCESS;
71 | }
72 |
73 | The output is:
74 |
75 | 76 |80 | 81 |77 | <Hello> <world> <foo> <bar> <yow> <baz> 78 |79 |
The next example shows tokenizing with two dropped delimiters '-' and 82 | ';' and a single kept delimiter '|'. We also specify that empty tokens 83 | should show up in the output when two delimiters are next to each 84 | other.
85 | 86 |
87 |
88 | // char_sep_example_2.cpp
89 | #include <iostream>
90 | #include <boost/tokenizer.hpp>
91 | #include <string>
92 |
93 | int main()
94 | {
95 | std::string str = ";;Hello|world||-foo--bar;yow;baz|";
96 | typedef boost::tokenizer<boost::char_separator<char> >
97 | tokenizer;
98 | boost::char_separator<char> sep("-;", "|", boost::keep_empty_tokens);
99 | tokenizer tokens(str, sep);
100 | for (tokenizer::iterator tok_iter = tokens.begin();
101 | tok_iter != tokens.end(); ++tok_iter)
102 | std::cout << "<" << *tok_iter << "> ";
103 | std::cout << "\n";
104 | return EXIT_SUCCESS;
105 | }
106 |
107 | The output is:
108 |
109 | 110 |114 | 115 |111 | <> <> <Hello> <|> <world> <|> <> <|> <> <foo> <> <bar> <yow> <baz> <|> <> 112 |113 |
The final example shows tokenizing on punctuation and whitespace 116 | characters using the default constructor of the 117 | char_separator.
118 | 119 |
120 |
121 | // char_sep_example_3.cpp
122 | #include <iostream>
123 | #include <boost/tokenizer.hpp>
124 | #include <string>
125 |
126 | int main()
127 | {
128 | std::string str = "This is, a test";
129 | typedef boost::tokenizer<boost::char_separator<char> > Tok;
130 | boost::char_separator<char> sep; // default constructed
131 | Tok tok(str, sep);
132 | for(Tok::iterator tok_iter = tok.begin(); tok_iter != tok.end(); ++tok_iter)
133 | std::cout << "<" << *tok_iter << "> ";
134 | std::cout << "\n";
135 | return EXIT_SUCCESS;
136 | }
137 |
138 | The output is:
139 |
140 | 141 |145 | 146 |142 | <This> <is> <,> <a> <test> 143 |144 |
| Parameter | 151 | 152 |Description | 153 | 154 |Default | 155 |
|---|---|---|
| Char | 159 | 160 |The type of elements within a token, typically char. | 161 | 162 |163 | |
| Traits | 167 | 168 |The char_traits for the character type. | 169 | 170 |char_traits<char> | 171 |
179 | explicit char_separator(const Char* dropped_delims, 180 | const Char* kept_delims = "", 181 | empty_token_policy empty_tokens = drop_empty_tokens) 182 |183 | 184 |
This creates a char_separator object, which can then be used to 185 | create a token_iterator or 186 | tokenizer to perform tokenizing. The 187 | dropped_delims and kept_delims are strings of characters 188 | where each character is used as delimiter during tokenizing. Whenever a 189 | delimiter is seen in the input sequence, the current token is finished, and 190 | a new token begins. The delimiters in dropped_delims do not show 191 | up as tokens in the output whereas the delimiters in kept_delims 192 | do show up as tokens. If empty_tokens is 193 | drop_empty_tokens, then empty tokens will not show up in the 194 | output. If empty_tokens is keep_empty_tokens then empty 195 | tokens will show up in the output.
196 |198 | explicit char_separator() 199 |200 | 201 |
The function std::isspace() is used to identify dropped 202 | delimiters and std::ispunct() is used to identify kept delimiters. 203 | In addition, empty tokens are dropped.
204 |206 | template <typename InputIterator, typename Token> 207 | bool operator()(InputIterator& next, InputIterator end, Token& tok) 208 |209 | 210 |
This function is called by the token_iterator to perform tokenizing. The 212 | user typically does not call this function directly.
213 |Revised 220 | 25 221 | December, 2006
222 | 223 |Copyright © 2001-2002 Jeremy Siek and John R. Bandela
224 | 225 |Distributed under the Boost Software License, Version 1.0. (See 226 | accompanying file LICENSE_1_0.txt or 227 | copy at http://www.boost.org/LICENSE_1_0.txt)
229 | 230 | 231 | -------------------------------------------------------------------------------- /include/boost/token_functions.hpp: -------------------------------------------------------------------------------- 1 | // Boost token_functions.hpp ------------------------------------------------// 2 | 3 | // Copyright John R. Bandela 2001. 4 | 5 | // Distributed under the Boost Software License, Version 1.0. (See 6 | // accompanying file LICENSE_1_0.txt or copy at 7 | // http://www.boost.org/LICENSE_1_0.txt) 8 | 9 | // See http://www.boost.org/libs/tokenizer/ for documentation. 10 | 11 | // Revision History: 12 | // 01 Oct 2004 Joaquin M Lopez Munoz 13 | // Workaround for a problem with string::assign in msvc-stlport 14 | // 06 Apr 2004 John Bandela 15 | // Fixed a bug involving using char_delimiter with a true input iterator 16 | // 28 Nov 2003 Robert Zeh and John Bandela 17 | // Converted into "fast" functions that avoid using += when 18 | // the supplied iterator isn't an input_iterator; based on 19 | // some work done at Archelon and a version that was checked into 20 | // the boost CVS for a short period of time. 21 | // 20 Feb 2002 John Maddock 22 | // Removed using namespace std declarations and added 23 | // workaround for BOOST_NO_STDC_NAMESPACE (the library 24 | // can be safely mixed with regex). 25 | // 06 Feb 2002 Jeremy Siek 26 | // Added char_separator. 27 | // 02 Feb 2002 Jeremy Siek 28 | // Removed tabs and a little cleanup. 29 | 30 | 31 | #ifndef BOOST_TOKEN_FUNCTIONS_JRB120303_HPP_ 32 | #define BOOST_TOKEN_FUNCTIONS_JRB120303_HPP_ 33 | 34 | #include