├── .DS_Store
├── CC-BY-NC-ND_license.md
├── README-English.md
├── README.md
├── book.png
├── class.png
├── datawhale.jpeg
├── markdowm
├── .DS_Store
├── class.png
├── copy.md
└── datawhale.jpeg
├── 第一单元项目集合
├── .DS_Store
├── .ipynb_checkpoints
│ ├── 第一章第三节:探索性数据分析--课程-checkpoint.ipynb
│ ├── 第一章第三节:探索性数据分析-checkpoint.ipynb
│ ├── 第一章:第一节数据载入及初步观察-checkpoint.ipynb
│ ├── 第一章:第一节数据载入及初步观察-课程-checkpoint.ipynb
│ ├── 第一章:第一节数据载入及初步观察.--课程-checkpoint.ipynb
│ ├── 第一章:第三节探索性数据分析-checkpoint.ipynb
│ ├── 第一章:第三节探索性数据分析-课程-checkpoint.ipynb
│ ├── 第一章:第二节pandas基础--课程-checkpoint.ipynb
│ ├── 第一章:第二节pandas基础-checkpoint.ipynb
│ └── 第一章:第二节pandas基础-课程-checkpoint.ipynb
├── test_1.csv
├── train.csv
├── train_chinese.csv
├── 第一章:第一节数据载入及初步观察-课程.ipynb
├── 第一章:第一节数据载入及初步观察.ipynb
├── 第一章:第三节探索性数据分析-课程.ipynb
├── 第一章:第三节探索性数据分析.ipynb
├── 第一章:第二节pandas基础-课程.ipynb
└── 第一章:第二节pandas基础.ipynb
├── 第三章项目集合
├── .DS_Store
├── .ipynb_checkpoints
│ ├── 第三章 模型建立和评估-checkpoint.ipynb
│ ├── 第三章模型建立和评估---评价--课程-checkpoint.ipynb
│ ├── 第三章模型建立和评估---评价-checkpoint.ipynb
│ ├── 第三章模型建立和评估---评价-课程-checkpoint.ipynb
│ ├── 第三章模型建立和评估--建模--课程-checkpoint.ipynb
│ ├── 第三章模型建立和评估--建模-checkpoint.ipynb
│ ├── 第三章模型建立和评估--建模-课程-checkpoint.ipynb
│ └── 第三章模型建立和评估-checkpoint.ipynb
├── 1578213400(1).jpg
├── 20170624105439491.png
├── Snipaste_2020-01-05_16-37-56.png
├── Snipaste_2020-01-05_16-38-26.png
├── Snipaste_2020-01-05_16-39-27.png
├── clear_data.csv
├── sklearn.png
├── sklearn01.png
├── train.csv
├── 大纲.md
├── 第三章模型建立和评估---评价-课程.ipynb
├── 第三章模型建立和评估---评价.ipynb
├── 第三章模型建立和评估--建模-课程.ipynb
├── 第三章模型建立和评估--建模.ipynb
└── 第三章模型建立和评估.ipynb
└── 第二章项目集合
├── .DS_Store
├── .ipynb_checkpoints
├── 第二章:第一节数据清洗及特征处理-checkpoint.ipynb
├── 第二章:第一节数据清洗及特征处理-课程-checkpoint.ipynb
├── 第二章:第三节 数据可视化--课程-checkpoint.ipynb
├── 第二章:第三节数据重构2-checkpoint.ipynb
├── 第二章:第三节数据重构2-课程-checkpoint.ipynb
├── 第二章:第二节数据重构1-checkpoint.ipynb
├── 第二章:第二节数据重构1-课程-checkpoint.ipynb
├── 第二章:第四节数据可视化-checkpoint.ipynb
└── 第二章:第四节数据可视化-课程-checkpoint.ipynb
├── data
├── .DS_Store
├── train-left-down.csv
├── train-left-up.csv
├── train-right-down.csv
└── train-right-up.csv
├── result.csv
├── sex_fare_survived.csv
├── test_ave.csv
├── test_clear.csv
├── test_cut.csv
├── test_fin.csv
├── test_pr.csv
├── train.csv
├── unit_result.csv
├── vis_data
├── result.csv
├── sex_fare_survived.csv
└── unit_result.csv
├── 第二章:第一节数据清洗及特征处理-课程.ipynb
├── 第二章:第一节数据清洗及特征处理.ipynb
├── 第二章:第三节数据重构2-课程.ipynb
├── 第二章:第三节数据重构2.ipynb
├── 第二章:第二节数据重构1-课程.ipynb
├── 第二章:第二节数据重构1.ipynb
├── 第二章:第四节数据可视化-课程.ipynb
└── 第二章:第四节数据可视化.ipynb
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/.DS_Store
--------------------------------------------------------------------------------
/CC-BY-NC-ND_license.md:
--------------------------------------------------------------------------------
1 | # CC-BY-NC-ND license
2 |
3 | ## Attribution-NonCommercial-NoDerivs 3.0 United States
4 |
5 |
6 |
7 | > CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE.
8 |
9 | ### *License*
10 |
11 | THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
12 |
13 | BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS.
14 |
15 | **1. Definitions**
16 |
17 | 1. **"Collective Work"** means a work, such as a periodical issue, anthology or encyclopedia, in which the Work in its entirety in unmodified form, along with one or more other contributions, constituting separate and independent works in themselves, are assembled into a collective whole. A work that constitutes a Collective Work will not be considered a Derivative Work (as defined below) for the purposes of this License.
18 | 2. **"Derivative Work"** means a work based upon the Work or upon the Work and other pre-existing works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which the Work may be recast, transformed, or adapted, except that a work that constitutes a Collective Work will not be considered a Derivative Work for the purpose of this License. For the avoidance of doubt, where the Work is a musical composition or sound recording, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose of this License.
19 | 3. **"Licensor"** means the individual, individuals, entity or entities that offers the Work under the terms of this License.
20 | 4. **"Original Author"** means the individual, individuals, entity or entities who created the Work.
21 | 5. **"Work"** means the copyrightable work of authorship offered under the terms of this License.
22 | 6. **"You"** means an individual or entity exercising rights under this License who has not previously violated the terms of this License with respect to the Work, or who has received express permission from the Licensor to exercise rights under this License despite a previous violation.
23 |
24 | **2. Fair Use Rights.** Nothing in this license is intended to reduce, limit, or restrict any rights arising from fair use, first sale or other limitations on the exclusive rights of the copyright owner under copyright law or other applicable laws.
25 |
26 | **3. License Grant.** Subject to the terms and conditions of this License, Licensor hereby grants You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright) license to exercise the rights in the Work as stated below:
27 |
28 | 1. to reproduce the Work, to incorporate the Work into one or more Collective Works, and to reproduce the Work as incorporated in the Collective Works; and,
29 | 2. to distribute copies or phonorecords of, display publicly, perform publicly, and perform publicly by means of a digital audio transmission the Work including as incorporated in Collective Works.
30 |
31 | The above rights may be exercised in all media and formats whether now known or hereafter devised. The above rights include the right to make such modifications as are technically necessary to exercise the rights in other media and formats, but otherwise you have no rights to make Derivative Works. All rights not expressly granted by Licensor are hereby reserved, including but not limited to the rights set forth in Sections 4(d) and 4(e).
32 |
33 | **4. Restrictions.**The license granted in Section 3 above is expressly made subject to and limited by the following restrictions:
34 |
35 | 1. You may distribute, publicly display, publicly perform, or publicly digitally perform the Work only under the terms of this License, and You must include a copy of, or the Uniform Resource Identifier for, this License with every copy or phonorecord of the Work You distribute, publicly display, publicly perform, or publicly digitally perform. You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of a recipient of the Work to exercise the rights granted to that recipient under the terms of the License. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties. When You distribute, publicly display, publicly perform, or publicly digitally perform the Work, You may not impose any technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License. This Section 4(a) applies to the Work as incorporated in a Collective Work, but this does not require the Collective Work apart from the Work itself to be made subject to the terms of this License. If You create a Collective Work, upon notice from any Licensor You must, to the extent practicable, remove from the Collective Work any credit as required by Section 4(c), as requested.
36 | 2. You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation. The exchange of the Work for other copyrighted works by means of digital file-sharing or otherwise shall not be considered to be intended for or directed toward commercial advantage or private monetary compensation, provided there is no payment of any monetary compensation in connection with the exchange of copyrighted works.
37 | 3. If You distribute, publicly display, publicly perform, or publicly digitally perform the Work (as defined in Section 1 above) or Collective Works (as defined in Section 1 above), You must, unless a request has been made pursuant to Section 4(a), keep intact all copyright notices for the Work and provide, reasonable to the medium or means You are utilizing: (i) the name of the Original Author (or pseudonym, if applicable) if supplied, and/or (ii) if the Original Author and/or Licensor designate another party or parties (e.g. a sponsor institute, publishing entity, journal) for attribution ("Attribution Parties") in Licensor's copyright notice, terms of service or by other reasonable means, the name of such party or parties; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work. The credit required by this Section 4(c) may be implemented in any reasonable manner; provided, however, that in the case of a Collective Work, at a minimum such credit will appear, if a credit for all contributing authors of the Collective Work appears, then as part of these credits and in a manner at least as prominent as the credits for the other contributing authors. For the avoidance of doubt, You may only use the credit required by this clause for the purpose of attribution in the manner set out above and, by exercising Your rights under this License, You may not implicitly or explicitly assert or imply any connection with, sponsorship or endorsement by the Original Author, Licensor and/or Attribution Parties, as appropriate, of You or Your use of the Work, without the separate, express prior written permission of the Original Author, Licensor and/or Attribution Parties.
38 | 4. For the avoidance of doubt, where the Work is a musical composition:
39 | 1. **Performance Royalties Under Blanket Licenses.** Licensor reserves the exclusive right to collect whether individually or, in the event that Licensor is a member of a performance rights society (e.g. ASCAP, BMI, SESAC), via that society, royalties for the public performance or public digital performance (e.g. webcast) of the Work if that performance is primarily intended for or directed toward commercial advantage or private monetary compensation.
40 | 2. **Mechanical Rights and Statutory Royalties.** Licensor reserves the exclusive right to collect, whether individually or via a music rights agency or designated agent (e.g. Harry Fox Agency), royalties for any phonorecord You create from the Work ("cover version") and distribute, subject to the compulsory license created by 17 USC Section 115 of the US Copyright Act (or the equivalent in other jurisdictions), if Your distribution of such cover version is primarily intended for or directed toward commercial advantage or private monetary compensation.
41 | 5. **Webcasting Rights and Statutory Royalties.** For the avoidance of doubt, where the Work is a sound recording, Licensor reserves the exclusive right to collect, whether individually or via a performance-rights society (e.g. SoundExchange), royalties for the public digital performance (e.g. webcast) of the Work, subject to the compulsory license created by 17 USC Section 114 of the US Copyright Act (or the equivalent in other jurisdictions), if Your public digital performance is primarily intended for or directed toward commercial advantage or private monetary compensation.
42 |
43 | **5. Representations, Warranties and Disclaimer**
44 |
45 | UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR OFFERS THE WORK AS-IS AND ONLY TO THE EXTENT OF ANY RIGHTS HELD IN THE LICENSED WORK BY THE LICENSOR. THE LICENSOR MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MARKETABILITY, MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
46 |
47 | **6. Limitation on Liability.** EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
48 |
49 | **7. Termination**
50 |
51 | 1. This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License. Individuals or entities who have received Collective Works (as defined in Section 1 above) from You under this License, however, will not have their licenses terminated provided such individuals or entities remain in full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License.
52 | 2. Subject to the above terms and conditions, the license granted here is perpetual (for the duration of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to release the Work under different license terms or to stop distributing the Work at any time; provided, however that any such election will not serve to withdraw this License (or any other license that has been, or is required to be, granted under the terms of this License), and this License will continue in full force and effect unless terminated as stated above.
53 |
54 | **8. Miscellaneous**
55 |
56 | 1. Each time You distribute or publicly digitally perform the Work (as defined in Section 1 above) or a Collective Work (as defined in Section 1 above), the Licensor offers to the recipient a license to the Work on the same terms and conditions as the license granted to You under this License.
57 | 2. If any provision of this License is invalid or unenforceable under applicable law, it shall not affect the validity or enforceability of the remainder of the terms of this License, and without further action by the parties to this agreement, such provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable.
58 | 3. No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.
59 | 4. This License constitutes the entire agreement between the parties with respect to the Work licensed here. There are no understandings, agreements or representations with respect to the Work not specified here. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This License may not be modified without the mutual written agreement of the Licensor and You.
60 |
61 | > ### Creative Commons Notice
62 | >
63 | > Creative Commons is not a party to this License, and makes no warranty whatsoever in connection with the Work. Creative Commons will not be liable to You or any party on any legal theory for any damages whatsoever, including without limitation any general, special, incidental or consequential damages arising in connection to this license. Notwithstanding the foregoing two (2) sentences, if Creative Commons has expressly identified itself as the Licensor hereunder, it shall have all rights and obligations of Licensor.
64 | >
65 | > Except for the limited purpose of indicating to the public that the Work is licensed under the CCPL, Creative Commons does not authorize the use by either party of the trademark "Creative Commons" or any related trademark or logo of Creative Commons without the prior written consent of Creative Commons. Any permitted use will be in compliance with Creative Commons' then-current trademark usage guidelines, as may be published on its website or otherwise made available upon request from time to time. For the avoidance of doubt, this trademark restriction does not form part of this License.
66 | >
67 | > Creative Commons may be contacted at .
68 |
--------------------------------------------------------------------------------
/README-English.md:
--------------------------------------------------------------------------------
1 | # Hands-on data analysis
2 |
3 | ## Motivation
4 |
5 | **Hands-on data analysis** is Datawhale's open source project on the direction of data analysis. This project began in Datawhale's previous data analysis course, when I was a student who read the book - *python for data analysis* as the teaching material. The book for pandas and numpy operation is very clear and detailed, but for the logic of data analysis, there is much less content. So many learners and I found after learning, do not know what they have to do, when we meet data analysis problems. The idea of "I don't know how to use it" is actually very understandable, after learning the more theoretical things, there will be a small gap between the practical application in life and what we learned from the theory. How to bridge this gap may require your own experimentation and study of real-world materials.
6 |
7 | So if there is a course, it is a project-based line, the knowledge points bred in it, through the side of learning, while doing and being guided to make learning better. After learning the course, we can master pandas and can master the general experience of data analysis process. Through research, it seems that there are no projects on the market about **data analysis** that can fully meet the above criteria. So Datawhale's partners joined together to make an open source course to accomplish the small goals mentioned above, so that all the learners who have used our course can better start their data analysis journey.
8 |
9 | Now the course has been updated to version 1.3, we have improved the learning process, as well as providing better answers to explain. Later on, we will gradually launch the supporting materials. We still want to start from the basic data analysis operation and data analysis process, and introduce real-world examples in each module. After that, we will continue to add new content (such as data mining algorithms and so on). This is an open source project, we will keep iterating, and we will all participate and work together.
10 |
11 | About the name of our project - *hands-on data analysis* . Data analysis is a process to see the truth from a bunch of numbers.Learning to manipulate data is only part of the skill of data analysis, the other half is the experience inside the brain. So we need to think more and summarize more in the learning process, and more hands-on, realistic code. So I also hope that when you learn this course, you will reason more and ask more why; practice more and make sure that the theory and practice are combined. At the end of the course, you will definitely have a big harvest.
12 |
13 | #### Matching materials
14 |
15 | Since this is a course born out of Datawhale, it is better to learn it with other resources that Datawhale provides. The code we provide is in the form of a jupyter, which contains the tasks you have to complete, as well as the hints and guidance we give you, so this format combined with Datawhale's [group learning](https://github.com/datawhalechina/team-learning-data-mining ), you can discuss with everyone and add information together, then the learning effect will definitely be doubled. Also, Datawhale previously open-sourced a pandas tutorial - [Joyful-Pandas](https://github.com/datawhalechina/joyful-pandas). It composes the logic of Pandas as well as the code demonstration, so in our data analysis course, about the operation of Pandas, you can refer to *Joyful-Pandas*, which can make your data analysis learning more rewarding.
16 |
17 |
18 | ## Project scheduling
19 |
20 | #### Schedul
21 |
22 | The course is now divided into three units, which can be roughly divided into: Basic Data Operations, Data Cleaning and Reconstruction, and Modeling and Evaluation:
23 |
24 | 1. Part I: We get a data to be analyzed, I have to learn how to load the data, view the data, then learn some basic operations of Pandas, and finally start to try exploratory data analysis.
25 | 2. Part 2: After we can be more proficient in manipulating the data and recognizing the data, we need to start data cleaning and reconstruction to turn the original data into a usable data, in preparation for putting it into the model later.
26 | 3. Part 3: We have to consider what model to build depending on the task requirements, and we use the popular sklearn library to build the model. For a model to be good or bad, we are required to evaluate it, after that we evaluate our model and do optimization of the model.
27 |
28 | | Chapter | Summary |
29 | | :-------- | :---------------------------------------: |
30 | | Chapter 1 | Data loading and preliminary observations |
31 | | | Pandas basics explained |
32 | | | Exploratory Data Analysis |
33 | | Chapter 2 | Data cleaning and feature processing |
34 | | | Data Reconstruction 1 |
35 | | | Data Reconstruction 2 |
36 | | | Data Visualization |
37 | | Chapter 3 | Data Modeling |
38 | | | Model Evaluation |
39 |
40 | #### How to learn
41 |
42 | Our codes are in jupyter form, and each part of the course is divided into two parts **Course** and **Answers**. During the learning period, in the course code, finish all the learning, find the information by yourself, finish the code operation inside by yourself, think about the part and the insights. After that, you can discuss with your buddies and share the information and insights. About the answer part, you can refer to, because the data analysis itself is open, so the answer is also open, more hope that you can have their own understanding and answers. If you need a reference, we provide the answers we wrote in the **Answers** section, so you can refer to them.
43 |
44 |
45 |
46 | (课程部分-需要自己根据要求敲代码)
47 |
48 |
49 |
50 |
51 |
52 | ## Feedback
53 |
54 | Feedback from learners of previous versions
55 |
56 | > As a learner with no foundation, I am very comfortable learning data analysis in this period, the tutorials are also relatively simple and clear, and the overall learning is very smooth. Each task I will read the tutorial twice. The first time only watch the tutorial and then chew the book using *Python for data analysis*. The assignments were great in terms of expansion, which I really liked. Then the second time I read the tutorial was to finish the homework and reflection without reading the answers at all. Basically, it is still a great sense of accomplishment after learning, and really have learned a lot. This course as an introduction to data analysis course, really great!
57 | >
58 | > --------Danfei Wu, North China Electric Power University
59 |
60 |
61 |
62 | > First of all this learning document is very well done and very guided. I like the way of learning in the project - active learning and searching if you don't understand.
63 | >
64 | >-------- Li Qingqing
65 |
66 | >
67 | >
68 | >Helped a lot. After I finished the program, I will still use the skills from the course in my real job. I hope that a later version of the course will include a section on data analysis logic.
69 | >
70 | >--------Version V1.0 Group Study Participants
71 |
72 | **Excellent student Liu Chuchu Excellent assignment**:https://space.bilibili.com/621981283/channel/detail?cid=191222
73 |
74 | (Welcome to watch the video that explan the all assignments)
75 |
76 | ## Improvement methods
77 |
78 | If you don't find what you want in Hands-on Data Analysis, or if you find an error in your project, please don't hesitate to go to our GitHub Issues for feedback, we will reply to you within 24 hours, and you can contact me by email if you don't reply after 24 hours ( chenands@qq.com).
79 |
80 | ## Contributors
81 |
82 | **Project leader**
83 |
84 | Andong Chen: Datawhale Member, Hu Nan University|Queen Mary University of London
85 |
86 | **Core contributors**
87 |
88 | Juanjuan Jin: Datawhale member, Master of Zhejiang University
89 |
90 | Yang Jada: Datawhale member, data mining engineer
91 |
92 | Lao Cousin: Datawhale member, author of the *Jane said Python*
93 |
94 | **Contributor**
95 |
96 | Hongxing: Datawhale member, data analyst
97 |
98 | Li Ling: Datawhale member, algorithm engineer
99 |
100 | Gao Liye: Datawhale member, graduate student of Taiyuan University of Technology
101 |
102 | Zhang Wentao: Datawhale member, PhD student at Sun Yat-sen University
103 |
104 | ## Follow us
105 |
106 |
Scan the QR code below and reply with the keyword "动手学数据分析" to join the "Project Exchange Group".
107 |

108 |
109 |
110 |
111 |
112 | ## LICENSE
113 |
114 |
115 | Copyright License: CC-BY-NC-ND license
116 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Hands-on data analysis
2 |
3 | # 动手学数据分析
4 |
5 | ## 🎉 本课程在智海(国家级的AI科教平台)上线啦
6 |
7 | 课程地址:https://aiplusx.momodel.cn/classroom/class/664b498628cf753a70eb516e?activeKey=intro
8 |
9 |
10 | ## 项目初衷
11 |
12 | 动手学数据分析是Datawhale关于数据分析方向的开源项目,这个项目始于Datawhale以前的数据分析课程,那时我作为一名学员的以《python for data analysis》这本书为教材教材,通过刷这本教材的代码来学习数据分析,书里对于pandas和numpy操作讲的很细,但是对于数据分析的逻辑的内容,就少了很多。所以很多学习者和我学完之后发现,敲了一堆代码并不知道它们有什么用。“不知道怎么用”这个想法其实很好理解,在学完了比较理论的东西之后,在生活中实际运用方式和从理论中学到的会有不小的鸿沟。如何抹平这个鸿沟,可能就需要自己的尝试以及学习实战的资料。
13 |
14 | 所以有没有这样一门课,以项目为主线,将知识点孕育其中,通过边学,边做以及边被引导的方式来使学习效果达到更好,学完之后既能掌握pandas等的知识点又能掌握数据分析的大致思路和流程。通过调查发现,市面上关于**数据分析**的项目好像没有可以完全符合这样标准的(失望.jpg)。所以Datawhale的小伙伴一拍即合,一起来做一门这样的开源课程,完成上面所说的那些小目标,让所有使用了我们课程的小伙伴可以更好的开启他的数据分析之路。
15 |
16 | 现在这门课程已经更新到了1.3版本,我们改善了更好的学习流程,以及提供了更好的答案讲解。后期将会逐步推出配套的教材。我们还是希望从基础的数据分析操作和数据分析流程讲起,在每个模块都引入实战的例子。之后会不断加入新的内容(比如数据挖掘的算法之类的)。这是开源项目,我们会不断迭代,大家共同参与,一起努力。
17 |
18 | 关于我们项目的名字——动手学数据分析(Hands-on data analysis)。数据分析是一个要从一堆数字中看到真相的过程。学会操作数据只是数据分析的一半功力,剩下的另一半要用我们的大脑,多多思考,多多总结,更要多动手,实打实的的敲代码。所以也希望在学习这门课时,多去推理,多去问问为什么;多多练习,确保理论实践结合起来,在课程结束的时候一定会有大收获。
19 |
20 |
21 | #### 搭配资料
22 |
23 | 既然这是一门诞生于Datawhale的课程,学习它的时候搭配datawhale所配备其他资源会更好。我们提供的代码是jupyter形式的,里面有你所要完成的任务,也有我们给你的提示和引导,所以这样的形式再结合Datawhale的[组队学习](https://github.com/datawhalechina/team-learning-data-mining),可以和大家一起讨论,一起补充资料,那么学习效果一定会加倍。还有,Datawhale之前开源了一门pandas的教程—[Joyful-Pandas](https://github.com/datawhalechina/joyful-pandas)。里面梳理了Pandas的逻辑以及代码展示,所以在我们数据分析的课程中,关于Pandas的操作,你可以参考*Joyful-Pandas*,可以让你的数据分析学习事半功倍。
24 |
25 |
26 | ## 项目编排与服用方法
27 |
28 | #### 编排
29 |
30 | 课程现分为三个单元,大致可以分为:数据基础操作,数据清洗与重构,建模和评估。
31 |
32 | 1. 第一部分:我们获得一个要分析的数据,我要学会如何加载数据,查看数据,然后学习Pandas的一些基础操作,最后开始尝试探索性的数据分析。
33 | 2. 第二部分:当我们可以比较熟练的操作数据并认识这个数据之后,我们需要开始数据清洗以及重构,将原始数据变为一个可用好用的数据,为之后放入模型做准备
34 | 3. 第三单元:我们根据任务需求不同,要考虑建立什么模型,我们使用流行的sklearn库,建立模型。对于一个模型的好坏,我们是需要评估的,之后我们会评估我们的模型,对模型做优化。
35 |
36 | | 章节 | 小结 |
37 | | :----- | :----------------: |
38 | | 第一章 | 数据载入及初步观察 |
39 | | | pandas基础 |
40 | | | 探索性数据分析 |
41 | | 第二章 | 数据清洗及特征处理 |
42 | | | 数据重构1 |
43 | | | 数据重构2 |
44 | | | 数据可视化 |
45 | | 第三章 | 数据建模 |
46 | | | 模型评估 |
47 |
48 | #### 服用方法
49 |
50 | 我们的代码都是jupyter形式,每个部分的课程都分为**课程**和**答案**两个部分。学习期间,在课程代码中,完成所有的学习,自己查找资料,自己完成里面的代码操作,思考部分以及心得。之后可以和小伙伴讨论,分享资料和心得。关于答案部分,大家可以参考,但是由于数据分析本身是开放的,所以答案也是开放式的,更多希望大家可以有自己理解和答案。 如果需要参考,我们在**答案** 部分提供了我们写的答案,大家可以参考。
51 |
52 |
53 |
54 | (课程部分-需要自己根据要求敲代码)
55 |
56 |
57 |
58 | (参考答案部分-如果有问题可以参考我们提供的答案)
59 |
60 | ## 反馈
61 |
62 | 之前版本学习者反馈
63 |
64 | > 作为一个没基础的小白学习者,这期动手学数据分析我学得很舒服,教程也比较简单和清楚,整体学下来感觉是很流畅的。每个task我都会把教程看两遍。第一遍只看教程,按着教程思路顺下来,看完以后再啃利用Python进行数据分析这本书,边看边做笔记(顺便把csdn写了)作业的拓展性这里必须给我加分,然后第二遍看教程就是完全不看答案把作业和思考做完。基本上学完还是很有成就感的,而且真的有学到很多东西。这个课程作为一个数据分析的入门课程,真的巨赞!
65 | >
66 | > --------华北电力大学,吴丹飞
67 |
68 |
69 |
70 | > 首先这个学习文档做得很好,很有引导性,也是我看下来项目中比较好的一种学习方式——主动学习,不懂就搜索、问。
71 | > 作为有Python数据分析基础的学员很有复习、提高、巩固的功能。项目相对来说没有那么的贴切生活.建模那一块我之前有接触过,但是我搞不懂就没什么可建议的了.
72 | >
73 | > -------- 李晴晴ß
74 |
75 | >
76 | >
77 | >帮助还不小,因为后面做项目感觉还在不断用这些技巧,非常有用。我觉得可以加点数据分析的分析思路过程。
78 | >
79 | >--------V1.0 版组队学习参与者
80 |
81 | ## 改进方式
82 |
83 | 若动手学数据分析里没有你想要的内容,或者你发现项目中哪里有错误,请毫不犹豫地去我们GitHub的Issues进行反馈,说明提问内容属于哪一个部分,然后提交你希望补充内容或者勘误信息,我们通常会在24小时以内给您回复,超过24小时未回复的话可以邮件联系我(chenands@qq.com);
84 |
85 | ## 贡献者
86 |
87 | **项目负责人**
88 |
89 | 陈安东:Datawhale成员,哈尔滨工业大学|Queen Marry University of London(项目负责人)
90 |
91 | **核心贡献者**
92 |
93 | 金娟娟:Datawhale成员,浙江大学硕士
94 |
95 | 杨佳达:Datawhale成员,数据挖掘工程师
96 |
97 | 老表:Datawhale成员,公众号简说Python作者
98 |
99 |
100 |
101 | **贡献者**
102 |
103 | 红星:Datawhale成员,数据分析师
104 |
105 | 李玲:Datawhale成员,算法工程师
106 |
107 | 高立业:Datawhale成员,太原理工大学研究生
108 |
109 | 张文涛:Datawhale成员,中山大学博士研究生
110 |
111 |
112 |
113 | ## 关注我们
114 |
115 |
扫描下方二维码,然后回复关键词“动手学数据分析”,即可加入“项目交流群”
116 |

117 |
118 |
119 |
120 | ## LICENSE
121 |
122 |
123 | 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
124 |
125 |
--------------------------------------------------------------------------------
/book.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/book.png
--------------------------------------------------------------------------------
/class.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/class.png
--------------------------------------------------------------------------------
/datawhale.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/datawhale.jpeg
--------------------------------------------------------------------------------
/markdowm/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/markdowm/.DS_Store
--------------------------------------------------------------------------------
/markdowm/class.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/markdowm/class.png
--------------------------------------------------------------------------------
/markdowm/copy.md:
--------------------------------------------------------------------------------
1 | # Hands-on data analysis
2 |
3 | # 动手学数据分析
4 |
5 | ## 项目初衷
6 |
7 | 动手学数据分析是Datawhale关于数据分析方向的开源项目,这个项目始于Datawhale以前的数据分析课程,那时我作为一名学员的以《python for data analysis》这本书为教材教材,通过刷这本教材的代码来学习数据分析,书里对于pandas和numpy操作讲的很细,但是对于数据分析的逻辑的内容,就少了很多。所以很多学习者和我学完之后发现,敲了一堆代码并不知道它们有什么用。“不知道怎么用”这个想法其实很好理解,在学完了比较理论的东西之后,在生活中实际运用方式和从理论中学到的会有不小的鸿沟。如何抹平这个鸿沟,可能就需要自己的尝试以及学习实战的资料。
8 |
9 | 所以有没有这样一门课,以项目为主线,将知识点孕育其中,通过边学,边做以及边被引导的方式来使学习效果达到更好,学完之后既能掌握pandas等的知识点又能掌握数据分析的大致思路和流程。通过调查发现,市面上关于**数据分析**的项目好像没有可以完全符合这样标准的(失望.jpg)。所以Datawhale的小伙伴一拍即合,一起来做一门这样的开源课程,完成上面所说的那些小目标,让所有使用了我们课程的小伙伴可以更好的开启他的数据分析之路。
10 |
11 | 现在这门课程已经更新到了1.3版本,我们改善了更好的学习流程,以及提供了更好的答案讲解。后期将会逐步推出配套的聚财。我们还是希望从基础的数据分析操作和数据分析流程讲起,在每个模块都引入实战的例子。之后会不断加入新的内容(比如数据挖掘的算法之类的)。这是开源项目,我们会不断迭代,大家共同参与,一起努力。
12 |
13 | 关于我们项目的名字——动手学数据分析(Hands-on data analysis)。数据分析是一个要从一堆数字中看到真相的过程。学会操作数据只是数据分析的一半功力,剩下的另一半要用我们的大脑,多多思考,多多总结,更要多动手,实打实的的敲代码。所以也希望在学习这门课时,多去推理,多去问问为什么;多多练习,确保理论实践结合起来,在课程结束的时候一定会有大收获。
14 |
15 | #### 搭配资料
16 |
17 | 既然这是一门诞生于Datawhale的课程,学习它的时候搭配datawhale所配备其他资源会更好。我们提供的代码是jupyter形式的,里面有你所要完成的任务,也有我们给你的提示和引导,所以这样的形式再结合Datawhale的[组队学习](https://github.com/datawhalechina/team-learning),可以和大家一起讨论,一起补充资料,那么学习效果一定会加倍。还有,Datawhale之前开源了一门pandas的教程—[Joyful-Pandas](https://github.com/datawhalechina/joyful-pandas)。里面梳理了Pandas的逻辑以及代码展示,所以在我们数据分析的课程中,关于Pandas的操作,你可以参考*Joyful-Pandas*,可以让你的数据分析学习事半功倍。
18 |
19 |
20 | ## 项目编排与服用方法
21 |
22 | #### 编排
23 |
24 | 课程现分为三个单元,大致可以分为:数据基础操作,数据清洗与重构,建模和评估。
25 |
26 | 1. 第一部分:我们获得一个要分析的数据,我要学会如何加载数据,查看数据,然后学习Pandas的一些基础操作,最后开始尝试探索性的数据分析。
27 | 2. 第二部分:当我们可以比较熟练的操作数据并认识这个数据之后,我们需要开始数据清洗以及重构,将原始数据变为一个可用好用的数据,为之后放入模型做准备
28 | 3. 第三单元:我们根据任务需求不同,要考虑建立什么模型,我们使用流行的sklearn库,建立模型。对于一个模型的好坏,我们是需要评估的,之后我们会评估我们的模型,对模型做优化。
29 |
30 | | 章节 | 小结 |
31 | | :----- | :----------------: |
32 | | 第一章 | 数据载入及初步观察 |
33 | | | pandas基础 |
34 | | | 探索性数据分析 |
35 | | 第二章 | 数据清洗及特征处理 |
36 | | | 数据重构1 |
37 | | | 数据重构2 |
38 | | | 数据可视化 |
39 | | 第三章 | 数据建模 |
40 | | | 模型评估 |
41 |
42 | #### 服用方法
43 |
44 | 我们的代码都是jupyter形式,每个部分的课程都分为**课程**和**答案**两个部分。学习期间,在课程代码中,完成所有的学习,自己查找资料,自己完成里面的代码操作,思考部分以及心得。之后可以和小伙伴讨论,分享资料和心得。关于答案部分,大家可以参考,但是由于数据分析本身是开放的,所以答案也是开放式的,更多希望大家可以有自己理解和答案。 如果需要参考,我们在**答案** 部分提供了我们写的答案,大家可以参考。
45 |
46 |
47 |
48 | (课程部分-需要自己根据要求敲代码)
49 |
50 |
51 |
52 | (参考答案部分-如果有问题可以参考我们提供的答案)
53 |
54 | ## 反馈
55 |
56 | 之前版本学习者反馈
57 |
58 | > 作为一个没基础的小白学习者,这期动手学数据分析我学得很舒服,教程也比较简单和清楚,整体学下来感觉是很流畅的。每个task我都会把教程看两遍。第一遍只看教程,按着教程思路顺下来,看完以后再啃利用Python进行数据分析这本书,边看边做笔记(顺便把csdn写了)作业的拓展性这里必须给我加分,然后第二遍看教程就是完全不看答案把作业和思考做完。基本上学完还是很有成就感的,而且真的有学到很多东西。这个课程作为一个数据分析的入门课程,真的巨赞!
59 | >
60 | > --------华北电力大学,吴丹飞
61 |
62 |
63 |
64 | > 首先这个学习文档做得很好,很有引导性,也是我看下来项目中比较好的一种学习方式——主动学习,不懂就搜索、问。
65 | > 作为有Python数据分析基础的学员很有复习、提高、巩固的功能。项目相对来说没有那么的贴切生活.建模那一块我之前有接触过,但是我搞不懂就没什么可建议的了.
66 | >
67 | > -------- 李晴晴
68 |
69 | >
70 | >
71 | >帮助还不小,因为后面做项目感觉还在不断用这些技巧,非常有用。我觉得可以加点数据分析的分析思路过程。
72 | >
73 | >--------V1.0 版组队学习参与者
74 |
75 | ## 改进方式
76 |
77 | 若动手学数据分析里没有你想要的内容,或者你发现项目中哪里有错误,请毫不犹豫地去我们GitHub的Issues进行反馈,说明提问内容属于哪一个部分,然后提交你希望补充内容或者勘误信息,我们通常会在24小时以内给您回复,超过24小时未回复的话可以邮件联系我(chenands@qq.com);
78 |
79 | ## 贡献者
80 |
81 | **核心贡献者**
82 |
83 | 陈安东:Datawhale成员,中央民族大学|Queen Marry University of London(项目负责人)
84 |
85 | **贡献者**
86 |
87 | 金娟娟:Datawhale成员,浙江大学硕士
88 |
89 | 杨佳达:Datawhale成员,数据挖掘师
90 |
91 | 老表:Datawhale成员,公众号简说Python作者
92 |
93 |
94 |
95 | **组队学习贡献者**
96 |
97 | 红星:Datawhale成员,数据分析师
98 |
99 | 李玲:Datawhale成员,算法工程师
100 |
101 | 高立业:Datawhale成员,太原理工大学研究生
102 |
103 | 张文涛:Datawhale成员,中山大学博士研究生
--------------------------------------------------------------------------------
/markdowm/datawhale.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/markdowm/datawhale.jpeg
--------------------------------------------------------------------------------
/第一单元项目集合/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第一单元项目集合/.DS_Store
--------------------------------------------------------------------------------
/第一单元项目集合/.ipynb_checkpoints/第一章:第一节数据载入及初步观察-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习**:这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上[泰坦尼克的任务](https://www.kaggle.com/c/titanic/overview),实战数据分析全流程。\n",
8 | "这里有两份资料:\n",
9 | "教材《Python for Data Analysis》和 baidu.com &\n",
10 | "google.com(善用搜索引擎)"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "## 1 第一章:数据载入及初步观察"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### 1.1 载入数据\n",
25 | "数据集下载 https://www.kaggle.com/c/titanic/overview"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "#### 1.1.1 任务一:导入numpy和pandas"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "#写入代码\n",
42 | "\n"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "#### 1.1.2 任务二:载入数据\n",
57 | "(1) 使用相对路径载入数据 \n",
58 | "(2) 使用绝对路径载入数据"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 1,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "#写入代码\n",
68 | "\n"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 1,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "#写入代码\n",
78 | "\n"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。 \n",
86 | "【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下'.tsv'和'.csv'的不同,如何加载这两个数据集? \n",
87 | "【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "#### 1.1.3 任务三:每1000行为一个数据模块,逐块读取"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 1,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "#写入代码\n",
104 | "\n"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "【思考】什么是逐块读取?为什么要逐块读取呢?\n",
112 | "\n",
113 | "【提示】大家可以chunker(数据块)是什么类型?用`for`循环打印出来出处具体的样子是什么?"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "#### 1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]\n",
121 | "PassengerId => 乘客ID \n",
122 | "Survived => 是否幸存 \n",
123 | "Pclass => 乘客等级(1/2/3等舱位) \n",
124 | "Name => 乘客姓名 \n",
125 | "Sex => 性别 \n",
126 | "Age => 年龄 \n",
127 | "SibSp => 堂兄弟/妹个数 \n",
128 | "Parch => 父母与小孩个数 \n",
129 | "Ticket => 船票信息 \n",
130 | "Fare => 票价 \n",
131 | "Cabin => 客舱 \n",
132 | "Embarked => 登船港口 "
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 1,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "#写入代码\n",
142 | "\n"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "### 1.2 初步观察\n",
157 | "导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "#### 1.2.1 任务一:查看数据的基本信息"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 1,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "#写入代码\n",
174 | "\n"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "【提示】有多个函数可以这样做,你可以做一下总结"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "#### 1.2.2 任务二:观察表格前10行的数据和后15行的数据"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 1,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "#写入代码\n",
198 | "\n"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 1,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "#写入代码\n",
208 | "\n"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "#### 1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 1,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "#写入代码\n",
225 | "\n"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "【总结】上面的操作都是数据分析中对于数据本身的观察"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "### 1.3 保存数据"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "#### 1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 1,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "#写入代码\n",
263 | "# 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’utf-8‘‘`\n"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。"
271 | ]
272 | }
273 | ],
274 | "metadata": {
275 | "kernelspec": {
276 | "display_name": "Python 3",
277 | "language": "python",
278 | "name": "python3"
279 | },
280 | "language_info": {
281 | "codemirror_mode": {
282 | "name": "ipython",
283 | "version": 3
284 | },
285 | "file_extension": ".py",
286 | "mimetype": "text/x-python",
287 | "name": "python",
288 | "nbconvert_exporter": "python",
289 | "pygments_lexer": "ipython3",
290 | "version": "3.8.10"
291 | },
292 | "toc": {
293 | "base_numbering": 1,
294 | "nav_menu": {},
295 | "number_sections": false,
296 | "sideBar": true,
297 | "skip_h1_title": false,
298 | "title_cell": "Table of Contents",
299 | "title_sidebar": "Contents",
300 | "toc_cell": false,
301 | "toc_position": {
302 | "height": "calc(100% - 180px)",
303 | "left": "10px",
304 | "top": "150px",
305 | "width": "582px"
306 | },
307 | "toc_section_display": true,
308 | "toc_window_display": true
309 | }
310 | },
311 | "nbformat": 4,
312 | "nbformat_minor": 4
313 | }
314 |
--------------------------------------------------------------------------------
/第一单元项目集合/.ipynb_checkpoints/第一章:第一节数据载入及初步观察.--课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上[泰坦尼克的任务](https://www.kaggle.com/c/titanic/overview),实战数据分析全流程。\n",
8 | "这里有两份资料:\n",
9 | "教材《Python for Data Analysis》第六章和 baidu.com &\n",
10 | "google.com(善用搜索引擎)"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "## 1 第一章:数据载入及初步观察"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### 1.1 载入数据\n",
25 | "数据集下载 https://www.kaggle.com/c/titanic/overview"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "#### 1.1.1 任务一:导入numpy和pandas"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 30,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "import numpy as np\n",
42 | "import pandas as pd"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "#### 1.1.2 任务二:载入数据\n",
57 | "(1) 使用相对路径载入数据载入‘train.csv’这个数据\n",
58 | "(2) 使用绝对路径载入数据‘train.csv’这个数据"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": []
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": []
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。 \n",
80 | "【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下'.tsv'和'.csv'的不同,如何加载这两个数据集? \n",
81 | "【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "#### 1.1.3 任务三:每1000行为一个数据模块,逐块读取"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": []
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "【思考】什么是逐块读取?为什么要逐块读取呢?"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "#### 1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]\n",
110 | "PassengerId => 乘客ID \n",
111 | "Survived => 是否幸存 \n",
112 | "Pclass => 乘客等级(1/2/3等舱位) \n",
113 | "Name => 乘客姓名 \n",
114 | "Sex => 性别 \n",
115 | "Age => 年龄 \n",
116 | "SibSp => 堂兄弟/妹个数 \n",
117 | "Parch => 父母与小孩个数 \n",
118 | "Ticket => 船票信息 \n",
119 | "Fare => 票价 \n",
120 | "Cabin => 客舱 \n",
121 | "Embarked => 登船港口 "
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": []
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "【思考】所谓将表头改为中文其中一个思路是:将英文额度表头替换成中文。还有其他的方法吗?"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "### 1.2 初步观察\n",
143 | "导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "#### 1.2.1 任务一:查看train.csv数据的基本信息"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": []
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "#### 1.2.2 任务二:观察表格前10行的数据和后15行的数据"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": null,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": []
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {},
178 | "outputs": [],
179 | "source": []
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "在收集数据的时候,我们课程kj'dc\n",
186 | "#### 1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 33,
192 | "metadata": {},
193 | "outputs": [
194 | {
195 | "data": {
196 | "text/html": [
197 | "\n",
198 | "\n",
211 | "
\n",
212 | " \n",
213 | " \n",
214 | " | \n",
215 | " 是否幸存 | \n",
216 | " 仓位等级 | \n",
217 | " 姓名 | \n",
218 | " 性别 | \n",
219 | " 年龄 | \n",
220 | " 兄弟姐妹个数 | \n",
221 | " 父母子女个数 | \n",
222 | " 船票信息 | \n",
223 | " 票价 | \n",
224 | " 客舱 | \n",
225 | " 登船港口 | \n",
226 | "
\n",
227 | " \n",
228 | " 乘客ID | \n",
229 | " | \n",
230 | " | \n",
231 | " | \n",
232 | " | \n",
233 | " | \n",
234 | " | \n",
235 | " | \n",
236 | " | \n",
237 | " | \n",
238 | " | \n",
239 | " | \n",
240 | "
\n",
241 | " \n",
242 | " \n",
243 | " \n",
244 | " PassengerId | \n",
245 | " False | \n",
246 | " False | \n",
247 | " False | \n",
248 | " False | \n",
249 | " False | \n",
250 | " False | \n",
251 | " False | \n",
252 | " False | \n",
253 | " False | \n",
254 | " False | \n",
255 | " False | \n",
256 | "
\n",
257 | " \n",
258 | " 1 | \n",
259 | " False | \n",
260 | " False | \n",
261 | " False | \n",
262 | " False | \n",
263 | " False | \n",
264 | " False | \n",
265 | " False | \n",
266 | " False | \n",
267 | " False | \n",
268 | " True | \n",
269 | " False | \n",
270 | "
\n",
271 | " \n",
272 | " 2 | \n",
273 | " False | \n",
274 | " False | \n",
275 | " False | \n",
276 | " False | \n",
277 | " False | \n",
278 | " False | \n",
279 | " False | \n",
280 | " False | \n",
281 | " False | \n",
282 | " False | \n",
283 | " False | \n",
284 | "
\n",
285 | " \n",
286 | " 3 | \n",
287 | " False | \n",
288 | " False | \n",
289 | " False | \n",
290 | " False | \n",
291 | " False | \n",
292 | " False | \n",
293 | " False | \n",
294 | " False | \n",
295 | " False | \n",
296 | " True | \n",
297 | " False | \n",
298 | "
\n",
299 | " \n",
300 | " 4 | \n",
301 | " False | \n",
302 | " False | \n",
303 | " False | \n",
304 | " False | \n",
305 | " False | \n",
306 | " False | \n",
307 | " False | \n",
308 | " False | \n",
309 | " False | \n",
310 | " False | \n",
311 | " False | \n",
312 | "
\n",
313 | " \n",
314 | "
\n",
315 | "
"
316 | ],
317 | "text/plain": [
318 | " 是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 \\\n",
319 | "乘客ID \n",
320 | "PassengerId False False False False False False False False False \n",
321 | "1 False False False False False False False False False \n",
322 | "2 False False False False False False False False False \n",
323 | "3 False False False False False False False False False \n",
324 | "4 False False False False False False False False False \n",
325 | "\n",
326 | " 客舱 登船港口 \n",
327 | "乘客ID \n",
328 | "PassengerId False False \n",
329 | "1 True False \n",
330 | "2 False False \n",
331 | "3 True False \n",
332 | "4 False False "
333 | ]
334 | },
335 | "execution_count": 33,
336 | "metadata": {},
337 | "output_type": "execute_result"
338 | }
339 | ],
340 | "source": []
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "【总结】上面的操作都是数据分析中对于数据本身的观察"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "### 1.3 保存数据"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "#### 1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 18,
373 | "metadata": {},
374 | "outputs": [],
375 | "source": [
376 | "df.to_csv('train_chinese.csv')"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": []
392 | }
393 | ],
394 | "metadata": {
395 | "kernelspec": {
396 | "display_name": "Python 3",
397 | "language": "python",
398 | "name": "python3"
399 | },
400 | "language_info": {
401 | "codemirror_mode": {
402 | "name": "ipython",
403 | "version": 3
404 | },
405 | "file_extension": ".py",
406 | "mimetype": "text/x-python",
407 | "name": "python",
408 | "nbconvert_exporter": "python",
409 | "pygments_lexer": "ipython3",
410 | "version": "3.7.3"
411 | },
412 | "toc": {
413 | "base_numbering": 1,
414 | "nav_menu": {},
415 | "number_sections": false,
416 | "sideBar": true,
417 | "skip_h1_title": false,
418 | "title_cell": "Table of Contents",
419 | "title_sidebar": "Contents",
420 | "toc_cell": false,
421 | "toc_position": {
422 | "height": "calc(100% - 180px)",
423 | "left": "10px",
424 | "top": "150px",
425 | "width": "582px"
426 | },
427 | "toc_section_display": true,
428 | "toc_window_display": true
429 | }
430 | },
431 | "nbformat": 4,
432 | "nbformat_minor": 2
433 | }
434 |
--------------------------------------------------------------------------------
/第一单元项目集合/.ipynb_checkpoints/第一章:第三节探索性数据分析-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是**探索性数据分析**,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# 1 第一章:探索性数据分析"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "#### 开始之前,导入numpy、pandas包和数据"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "#加载所需的库\n",
31 | "import numpy as np\n",
32 | "import pandas as pd"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {
39 | "scrolled": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "#载入之前保存的train_chinese.csv数据,关于泰坦尼克号的任务,我们就使用这个数据\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "### 1.6 了解你的数据吗?\n",
51 | "教材《Python for Data Analysis》第五章"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "#### 1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 3,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "# 具体请看《利用Python进行数据分析》第五章 排序和排名 部分\n",
68 | "\n",
69 | "#自己构建一个都为数字的DataFrame数据\n",
70 | "\n",
71 | "'''\n",
72 | "我们举了一个例子\n",
73 | "pd.DataFrame() :创建一个DataFrame对象 \n",
74 | "np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7\n",
75 | "index=[2,1] :DataFrame 对象的索引列\n",
76 | "columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行\n",
77 | "'''\n",
78 | "\n"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "【代码解析】\n",
86 | "\n",
87 | "pd.DataFrame() :创建一个DataFrame对象 \n",
88 | "\n",
89 | "np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7\n",
90 | "\n",
91 | "index=['2, 1] :DataFrame 对象的索引列\n",
92 | "\n",
93 | "columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 2,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "#回答代码\n"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "【思考】通过书本你能说出Pandas对DataFrame数据的其他排序方式吗?"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "【总结】下面将不同的排序方式做一个总结"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "1.让行索引升序排序"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 4,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "#代码\n"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "2.让列索引升序排序\n"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 4,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "#代码\n"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "3.让列索引降序排序"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 4,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#代码\n"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "4.让任选两列数据同时降序排序"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 4,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "#代码\n"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "#### 1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "scrolled": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "'''\n",
206 | "在开始我们已经导入了train_chinese.csv数据,而且前面我们也学习了导入数据过程,根据上面学习,我们直接对目标列进行排序即可\n",
207 | "head(20) : 读取前20条数据\n",
208 | "\n",
209 | "'''\n"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 4,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "#代码\n"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。\n",
226 | "\n",
227 | "当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "**多做几个数据的排序**"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 4,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "#代码\n"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 4,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "#写下你的思考\n",
253 | "\n",
254 | "\n",
255 | "\n",
256 | "\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "#### 1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 10,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "# 具体请看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分\n",
273 | "\n",
274 | "#自己构建两个都为数字的DataFrame数据\n",
275 | "\n",
276 | "\"\"\"\n",
277 | "我们举了一个例子:\n",
278 | "frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),\n",
279 | " columns=['a', 'b', 'c'],\n",
280 | " index=['one', 'two', 'three'])\n",
281 | "frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),\n",
282 | " columns=['a', 'e', 'c'],\n",
283 | " index=['first', 'one', 'two', 'second'])\n",
284 | "frame1_a\n",
285 | "\"\"\""
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 4,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "#代码\n"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {
300 | "scrolled": false
301 | },
302 | "source": [
303 | "将frame_a和frame_b进行相加\n"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 4,
309 | "metadata": {},
310 | "outputs": [],
311 | "source": [
312 | "#代码\n"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
\n",
320 | "当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。"
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "#### 1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": null,
333 | "metadata": {},
334 | "outputs": [],
335 | "source": [
336 | "'''\n",
337 | "还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?\n",
338 | "'''\n"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": 4,
344 | "metadata": {},
345 | "outputs": [],
346 | "source": [
347 | "#代码\n"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "**多做几个数据的相加,看看你能分析出什么?**"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 4,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "#代码\n"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": 5,
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "#写下你的其他分析\n",
380 | "\n",
381 | "\n",
382 | "\n",
383 | "\n",
384 | "\n"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "#### 1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 13,
397 | "metadata": {},
398 | "outputs": [],
399 | "source": [
400 | "#(1) 关键知识点示例做一遍(简单数据)\n",
401 | "# 具体请看《利用Python进行数据分析》第五章 汇总和计算描述统计 部分\n",
402 | "\n",
403 | "#自己构建一个有数字有空值的DataFrame数据\n",
404 | "\n",
405 | "\n",
406 | "\"\"\"\n",
407 | "我们举了一个例子:\n",
408 | "frame2 = pd.DataFrame([[1.4, np.nan], \n",
409 | " [7.1, -4.5],\n",
410 | " [np.nan, np.nan], \n",
411 | " [0.75, -1.3]\n",
412 | " ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])\n",
413 | "frame2\n",
414 | "\n",
415 | "\"\"\""
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": 4,
421 | "metadata": {},
422 | "outputs": [],
423 | "source": [
424 | "#代码\n"
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "调用 describe 函数,观察frame2的数据基本信息"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 4,
437 | "metadata": {},
438 | "outputs": [],
439 | "source": [
440 | "#代码\n"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "#### 1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {
454 | "scrolled": false
455 | },
456 | "outputs": [],
457 | "source": [
458 | "'''\n",
459 | "看看泰坦尼克号数据集中 票价 这列数据的基本统计数据\n",
460 | "'''"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": 4,
466 | "metadata": {},
467 | "outputs": [],
468 | "source": [
469 | "#代码\n"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "【思考】从上面数据我们可以看出,试试在下面写出你的看法。然后看看我们给出的答案。\n",
477 | "\n",
478 | "当然,答案只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。"
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "**多做几个组数据的统计,看看你能分析出什么?**"
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": 6,
491 | "metadata": {},
492 | "outputs": [],
493 | "source": [
494 | "# 写下你的其他分析\n",
495 | "\n",
496 | "\n"
497 | ]
498 | },
499 | {
500 | "cell_type": "markdown",
501 | "metadata": {},
502 | "source": [
503 | "【思考】有更多想法,欢迎写在你的学习笔记中。"
504 | ]
505 | },
506 | {
507 | "cell_type": "markdown",
508 | "metadata": {},
509 | "source": [
510 | "【总结】本节中我们通过Pandas的一些内置函数对数据进行了初步统计查看,这个过程最重要的不是大家得掌握这些函数,而是看懂从这些函数出来的数据,构建自己的数据分析思维,这也是第一章最重要的点,希望大家学完第一章能对数据有个基本认识,了解自己在做什么,为什么这么做,后面的章节我们将开始对数据进行清洗,进一步分析。"
511 | ]
512 | }
513 | ],
514 | "metadata": {
515 | "kernelspec": {
516 | "display_name": "Python 3",
517 | "language": "python",
518 | "name": "python3"
519 | },
520 | "language_info": {
521 | "codemirror_mode": {
522 | "name": "ipython",
523 | "version": 3
524 | },
525 | "file_extension": ".py",
526 | "mimetype": "text/x-python",
527 | "name": "python",
528 | "nbconvert_exporter": "python",
529 | "pygments_lexer": "ipython3",
530 | "version": "3.7.3"
531 | },
532 | "toc": {
533 | "base_numbering": 1,
534 | "nav_menu": {},
535 | "number_sections": false,
536 | "sideBar": true,
537 | "skip_h1_title": false,
538 | "title_cell": "Table of Contents",
539 | "title_sidebar": "Contents",
540 | "toc_cell": false,
541 | "toc_position": {
542 | "height": "calc(100% - 180px)",
543 | "left": "10px",
544 | "top": "150px",
545 | "width": "433px"
546 | },
547 | "toc_section_display": true,
548 | "toc_window_display": true
549 | }
550 | },
551 | "nbformat": 4,
552 | "nbformat_minor": 2
553 | }
554 |
--------------------------------------------------------------------------------
/第一单元项目集合/.ipynb_checkpoints/第一章:第二节pandas基础--课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是**了解字段含义以及初步观察数据**。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 1 第一章:数据载入及初步观察"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### 1.4 知道你的数据叫什么\n",
22 | "教材《Python for Data Analysis》第五章"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "#### 1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 1,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "import numpy as np\n",
39 | "import pandas as pd"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 2,
45 | "metadata": {},
46 | "outputs": [
47 | {
48 | "data": {
49 | "text/plain": [
50 | "Ohio 35000\n",
51 | "Texas 71000\n",
52 | "Oregon 16000\n",
53 | "Utah 5000\n",
54 | "dtype: int64"
55 | ]
56 | },
57 | "execution_count": 2,
58 | "metadata": {},
59 | "output_type": "execute_result"
60 | }
61 | ],
62 | "source": [
63 | "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n",
64 | "example_1 = pd.Series(sdata)\n",
65 | "example_1"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/html": [
76 | "\n",
77 | "\n",
90 | "
\n",
91 | " \n",
92 | " \n",
93 | " | \n",
94 | " state | \n",
95 | " year | \n",
96 | " pop | \n",
97 | "
\n",
98 | " \n",
99 | " \n",
100 | " \n",
101 | " 0 | \n",
102 | " Ohio | \n",
103 | " 2000 | \n",
104 | " 1.5 | \n",
105 | "
\n",
106 | " \n",
107 | " 1 | \n",
108 | " Ohio | \n",
109 | " 2001 | \n",
110 | " 1.7 | \n",
111 | "
\n",
112 | " \n",
113 | " 2 | \n",
114 | " Ohio | \n",
115 | " 2002 | \n",
116 | " 3.6 | \n",
117 | "
\n",
118 | " \n",
119 | " 3 | \n",
120 | " Nevada | \n",
121 | " 2001 | \n",
122 | " 2.4 | \n",
123 | "
\n",
124 | " \n",
125 | " 4 | \n",
126 | " Nevada | \n",
127 | " 2002 | \n",
128 | " 2.9 | \n",
129 | "
\n",
130 | " \n",
131 | " 5 | \n",
132 | " Nevada | \n",
133 | " 2003 | \n",
134 | " 3.2 | \n",
135 | "
\n",
136 | " \n",
137 | "
\n",
138 | "
"
139 | ],
140 | "text/plain": [
141 | " state year pop\n",
142 | "0 Ohio 2000 1.5\n",
143 | "1 Ohio 2001 1.7\n",
144 | "2 Ohio 2002 3.6\n",
145 | "3 Nevada 2001 2.4\n",
146 | "4 Nevada 2002 2.9\n",
147 | "5 Nevada 2003 3.2"
148 | ]
149 | },
150 | "execution_count": 3,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n",
157 | " 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}\n",
158 | "example_2 = pd.DataFrame(data)\n",
159 | "example_2"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "#### 1.4.2 任务二:根据上节课的方法载入\"train.csv\"文件和上一节课保存的\"train_chinese.csv\"文件\n"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": null,
172 | "metadata": {},
173 | "outputs": [],
174 | "source": []
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "我们在通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作\n",
181 | "#### 1.4.3 任务三:查看DataFrame(trian.csv)数据的每列的项"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": null,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": []
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "#### 1.4.4任务四:查看\"cabin\"这列的所有项 [有多种方法]"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 1,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "#方法一\n"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 2,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "#方法二\n"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的行删去\n",
221 | "#### 1.4.5 任务五:加载文件\"test_1.csv\",然后对比\"train.csv\",看看有哪些多出的列,然后将多出的列删除"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 3,
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "# 载入’train.csv‘和’test_1.csv‘两个数据\n"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "metadata": {},
237 | "outputs": [],
238 | "source": []
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "#### 1.4.6 任务六: 将['PassengerId','Name','Age','Ticket']这几个列元素隐藏,只观察其他几个列元素"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": null,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": []
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 5,
264 | "metadata": {
265 | "scrolled": true
266 | },
267 | "outputs": [],
268 | "source": [
269 | "# 思考回答,写到下面\n"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "### 1.5 轴的逻辑"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "#### 1.5.1 任务一: 我们以\"Age\"为筛选条件,显示年龄在10岁以下的乘客信息。"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": []
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "#### 1.5.2 任务二: 以\"Age\"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {},
304 | "outputs": [],
305 | "source": []
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "【提示】了解pandas的条件筛选方式以及如何使用交集和凝集操作"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "#### 1.5.3 任务三:将midage的数据中第100行的\"Pclass\"和\"Sex\"的数据显示出来"
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": null,
324 | "metadata": {},
325 | "outputs": [],
326 | "source": []
327 | },
328 | {
329 | "cell_type": "markdown",
330 | "metadata": {},
331 | "source": [
332 | "#### 1.5.4 任务四:将midage的数据中第100,105,223行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": null,
338 | "metadata": {
339 | "scrolled": true
340 | },
341 | "outputs": [],
342 | "source": []
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "【提示】使用pandas提出的简单方式,你可以看看loc方法"
349 | ]
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "#### 1.5.5 任务五:使用iloc方法将midage的数据中第100,105,223行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": null,
361 | "metadata": {},
362 | "outputs": [],
363 | "source": []
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": []
371 | }
372 | ],
373 | "metadata": {
374 | "kernelspec": {
375 | "display_name": "Python 3",
376 | "language": "python",
377 | "name": "python3"
378 | },
379 | "language_info": {
380 | "codemirror_mode": {
381 | "name": "ipython",
382 | "version": 3
383 | },
384 | "file_extension": ".py",
385 | "mimetype": "text/x-python",
386 | "name": "python",
387 | "nbconvert_exporter": "python",
388 | "pygments_lexer": "ipython3",
389 | "version": "3.7.3"
390 | },
391 | "toc": {
392 | "base_numbering": 1,
393 | "nav_menu": {},
394 | "number_sections": false,
395 | "sideBar": true,
396 | "skip_h1_title": false,
397 | "title_cell": "Table of Contents",
398 | "title_sidebar": "Contents",
399 | "toc_cell": false,
400 | "toc_position": {
401 | "height": "calc(100% - 180px)",
402 | "left": "10px",
403 | "top": "150px",
404 | "width": "582px"
405 | },
406 | "toc_section_display": true,
407 | "toc_window_display": true
408 | }
409 | },
410 | "nbformat": 4,
411 | "nbformat_minor": 2
412 | }
413 |
--------------------------------------------------------------------------------
/第一单元项目集合/.ipynb_checkpoints/第一章:第二节pandas基础-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是**了解字段含义以及初步观察数据**。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 1 第一章:数据载入及初步观察"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### 1.4 知道你的数据叫什么\n",
22 | "我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "**开始前导入numpy和pandas**"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 25,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "import numpy as np\n",
39 | "import pandas as pd"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "#### 1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 29,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "#写入代码\n"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "'''\n",
65 | "#我们举的例子\n",
66 | "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n",
67 | "example_1 = pd.Series(sdata)\n",
68 | "example_1\n",
69 | "'''"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "'''\n",
79 | "#我们举的例子\n",
80 | "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n",
81 | " 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}\n",
82 | "example_2 = pd.DataFrame(data)\n",
83 | "example_2\n",
84 | "'''\n",
85 | "\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "#### 1.4.2 任务二:根据上节课的方法载入\"train.csv\"文件\n"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 29,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "#写入代码\n"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "也可以加载上一节课保存的\"train_chinese.csv\"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作\n",
109 | "#### 1.4.3 任务三:查看DataFrame数据的每列的名称"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 29,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "#写入代码\n"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "#### 1.4.4任务四:查看\"Cabin\"这列的所有值[有多种方法]"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 29,
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "#写入代码\n"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 29,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "#写入代码\n"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "#### 1.4.5 任务五:加载文件\"test_1.csv\",然后对比\"train.csv\",看看有哪些多出的列,然后将多出的列删除\n",
151 | "经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 29,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "#写入代码\n"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 29,
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "#写入代码\n"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "【思考】还有其他的删除多余的列的方式吗?"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 1,
182 | "metadata": {},
183 | "outputs": [],
184 | "source": [
185 | "# 思考回答\n",
186 | "\n",
187 | "\n",
188 | "\n",
189 | "\n"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "#### 1.4.6 任务六: 将['PassengerId','Name','Age','Ticket']这几个列元素隐藏,只观察其他几个列元素"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 29,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "#写入代码\n"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "【思考回答】\n",
220 | "\n",
221 | "如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "### 1.5 筛选的逻辑"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。\n",
236 | "\n",
237 | "下面我们还是用实战来学习pandas这个功能。"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "#### 1.5.1 任务一: 我们以\"Age\"为筛选条件,显示年龄在10岁以下的乘客信息。"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 29,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "#写入代码\n"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "#### 1.5.2 任务二: 以\"Age\"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 29,
266 | "metadata": {},
267 | "outputs": [],
268 | "source": [
269 | "#写入代码\n"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "#### 1.5.3 任务三:将midage的数据中第100行的\"Pclass\"和\"Sex\"的数据显示出来"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 29,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": [
292 | "#写入代码\n"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "#### 1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 29,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "#写入代码\n"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "#### 1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 29,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "#写入代码\n"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "【思考】对比`iloc`和`loc`的异同"
339 | ]
340 | }
341 | ],
342 | "metadata": {
343 | "kernelspec": {
344 | "display_name": "Python 3",
345 | "language": "python",
346 | "name": "python3"
347 | },
348 | "language_info": {
349 | "codemirror_mode": {
350 | "name": "ipython",
351 | "version": 3
352 | },
353 | "file_extension": ".py",
354 | "mimetype": "text/x-python",
355 | "name": "python",
356 | "nbconvert_exporter": "python",
357 | "pygments_lexer": "ipython3",
358 | "version": "3.6.12"
359 | },
360 | "toc": {
361 | "base_numbering": 1,
362 | "nav_menu": {},
363 | "number_sections": false,
364 | "sideBar": true,
365 | "skip_h1_title": false,
366 | "title_cell": "Table of Contents",
367 | "title_sidebar": "Contents",
368 | "toc_cell": false,
369 | "toc_position": {
370 | "height": "calc(100% - 180px)",
371 | "left": "10px",
372 | "top": "150px",
373 | "width": "582px"
374 | },
375 | "toc_section_display": true,
376 | "toc_window_display": true
377 | }
378 | },
379 | "nbformat": 4,
380 | "nbformat_minor": 4
381 | }
382 |
--------------------------------------------------------------------------------
/第一单元项目集合/第一章:第一节数据载入及初步观察-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习**:这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上[泰坦尼克的任务](https://www.kaggle.com/c/titanic/overview),实战数据分析全流程。\n",
8 | "这里有两份资料:\n",
9 | "教材《Python for Data Analysis》和 baidu.com &\n",
10 | "google.com(善用搜索引擎)"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "## 1 第一章:数据载入及初步观察"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### 1.1 载入数据\n",
25 | "数据集下载 https://www.kaggle.com/c/titanic/overview"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "#### 1.1.1 任务一:导入numpy和pandas"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "#写入代码\n",
42 | "\n"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "#### 1.1.2 任务二:载入数据\n",
57 | "(1) 使用相对路径载入数据 \n",
58 | "(2) 使用绝对路径载入数据"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 1,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "#写入代码\n",
68 | "\n"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 1,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "#写入代码\n",
78 | "\n"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。 \n",
86 | "【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下'.tsv'和'.csv'的不同,如何加载这两个数据集? \n",
87 | "【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "#### 1.1.3 任务三:每1000行为一个数据模块,逐块读取"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 1,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "#写入代码\n",
104 | "\n"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "【思考】什么是逐块读取?为什么要逐块读取呢?\n",
112 | "\n",
113 | "【提示】大家可以chunker(数据块)是什么类型?用`for`循环打印出来出处具体的样子是什么?"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "#### 1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]\n",
121 | "PassengerId => 乘客ID \n",
122 | "Survived => 是否幸存 \n",
123 | "Pclass => 乘客等级(1/2/3等舱位) \n",
124 | "Name => 乘客姓名 \n",
125 | "Sex => 性别 \n",
126 | "Age => 年龄 \n",
127 | "SibSp => 堂兄弟/妹个数 \n",
128 | "Parch => 父母与小孩个数 \n",
129 | "Ticket => 船票信息 \n",
130 | "Fare => 票价 \n",
131 | "Cabin => 客舱 \n",
132 | "Embarked => 登船港口 "
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 1,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "#写入代码\n",
142 | "\n"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "### 1.2 初步观察\n",
157 | "导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "#### 1.2.1 任务一:查看数据的基本信息"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 1,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "#写入代码\n",
174 | "\n"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "【提示】有多个函数可以这样做,你可以做一下总结"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "#### 1.2.2 任务二:观察表格前10行的数据和后15行的数据"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 1,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "#写入代码\n",
198 | "\n"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 1,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "#写入代码\n",
208 | "\n"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "#### 1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 1,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "#写入代码\n",
225 | "\n"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "【总结】上面的操作都是数据分析中对于数据本身的观察"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "### 1.3 保存数据"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "#### 1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 1,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "#写入代码\n",
263 | "# 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’utf-8‘‘`\n"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。"
271 | ]
272 | }
273 | ],
274 | "metadata": {
275 | "kernelspec": {
276 | "display_name": "Python 3",
277 | "language": "python",
278 | "name": "python3"
279 | },
280 | "language_info": {
281 | "codemirror_mode": {
282 | "name": "ipython",
283 | "version": 3
284 | },
285 | "file_extension": ".py",
286 | "mimetype": "text/x-python",
287 | "name": "python",
288 | "nbconvert_exporter": "python",
289 | "pygments_lexer": "ipython3",
290 | "version": "3.8.10"
291 | },
292 | "toc": {
293 | "base_numbering": 1,
294 | "nav_menu": {},
295 | "number_sections": false,
296 | "sideBar": true,
297 | "skip_h1_title": false,
298 | "title_cell": "Table of Contents",
299 | "title_sidebar": "Contents",
300 | "toc_cell": false,
301 | "toc_position": {
302 | "height": "calc(100% - 180px)",
303 | "left": "10px",
304 | "top": "150px",
305 | "width": "582px"
306 | },
307 | "toc_section_display": true,
308 | "toc_window_display": true
309 | }
310 | },
311 | "nbformat": 4,
312 | "nbformat_minor": 4
313 | }
314 |
--------------------------------------------------------------------------------
/第一单元项目集合/第一章:第三节探索性数据分析-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是**探索性数据分析**,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# 1 第一章:探索性数据分析"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "#### 开始之前,导入numpy、pandas包和数据"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "#加载所需的库\n",
31 | "import numpy as np\n",
32 | "import pandas as pd"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {
39 | "scrolled": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "#载入之前保存的train_chinese.csv数据,关于泰坦尼克号的任务,我们就使用这个数据\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "### 1.6 了解你的数据吗?\n",
51 | "教材《Python for Data Analysis》第五章"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "#### 1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 3,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "# 具体请看《利用Python进行数据分析》第五章 排序和排名 部分\n",
68 | "\n",
69 | "#自己构建一个都为数字的DataFrame数据\n",
70 | "\n",
71 | "'''\n",
72 | "我们举了一个例子\n",
73 | "pd.DataFrame() :创建一个DataFrame对象 \n",
74 | "np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7\n",
75 | "index=[2,1] :DataFrame 对象的索引列\n",
76 | "columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行\n",
77 | "'''\n",
78 | "\n"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "【代码解析】\n",
86 | "\n",
87 | "pd.DataFrame() :创建一个DataFrame对象 \n",
88 | "\n",
89 | "np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7\n",
90 | "\n",
91 | "index=['2, 1] :DataFrame 对象的索引列\n",
92 | "\n",
93 | "columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 2,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "#回答代码\n"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "【思考】通过书本你能说出Pandas对DataFrame数据的其他排序方式吗?"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "【总结】下面将不同的排序方式做一个总结"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "1.让行索引升序排序"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 4,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "#代码\n"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "2.让列索引升序排序\n"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 4,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "#代码\n"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "3.让列索引降序排序"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 4,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#代码\n"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "4.让任选两列数据同时降序排序"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 4,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "#代码\n"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "#### 1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "scrolled": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "'''\n",
206 | "在开始我们已经导入了train_chinese.csv数据,而且前面我们也学习了导入数据过程,根据上面学习,我们直接对目标列进行排序即可\n",
207 | "head(20) : 读取前20条数据\n",
208 | "\n",
209 | "'''\n"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 4,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "#代码\n"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。\n",
226 | "\n",
227 | "当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "**多做几个数据的排序**"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 4,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "#代码\n"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 4,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "#写下你的思考\n",
253 | "\n",
254 | "\n",
255 | "\n",
256 | "\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "#### 1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 10,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "# 具体请看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分\n",
273 | "\n",
274 | "#自己构建两个都为数字的DataFrame数据\n",
275 | "\n",
276 | "\"\"\"\n",
277 | "我们举了一个例子:\n",
278 | "frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),\n",
279 | " columns=['a', 'b', 'c'],\n",
280 | " index=['one', 'two', 'three'])\n",
281 | "frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),\n",
282 | " columns=['a', 'e', 'c'],\n",
283 | " index=['first', 'one', 'two', 'second'])\n",
284 | "frame1_a\n",
285 | "\"\"\""
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 4,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "#代码\n"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "将frame_a和frame_b进行相加\n"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": 4,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "#代码\n"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
\n",
318 | "当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "#### 1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {},
332 | "outputs": [],
333 | "source": [
334 | "'''\n",
335 | "还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?\n",
336 | "'''\n"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 4,
342 | "metadata": {},
343 | "outputs": [],
344 | "source": [
345 | "#代码\n"
346 | ]
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "**多做几个数据的相加,看看你能分析出什么?**"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": 4,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "#代码\n"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": 5,
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "#写下你的其他分析\n",
378 | "\n",
379 | "\n",
380 | "\n",
381 | "\n",
382 | "\n"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "#### 1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 13,
395 | "metadata": {},
396 | "outputs": [],
397 | "source": [
398 | "#(1) 关键知识点示例做一遍(简单数据)\n",
399 | "# 具体请看《利用Python进行数据分析》第五章 汇总和计算描述统计 部分\n",
400 | "\n",
401 | "#自己构建一个有数字有空值的DataFrame数据\n",
402 | "\n",
403 | "\n",
404 | "\"\"\"\n",
405 | "我们举了一个例子:\n",
406 | "frame2 = pd.DataFrame([[1.4, np.nan], \n",
407 | " [7.1, -4.5],\n",
408 | " [np.nan, np.nan], \n",
409 | " [0.75, -1.3]\n",
410 | " ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])\n",
411 | "frame2\n",
412 | "\n",
413 | "\"\"\""
414 | ]
415 | },
416 | {
417 | "cell_type": "code",
418 | "execution_count": 4,
419 | "metadata": {},
420 | "outputs": [],
421 | "source": [
422 | "#代码\n"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "调用 describe 函数,观察frame2的数据基本信息"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": 4,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "#代码\n"
439 | ]
440 | },
441 | {
442 | "cell_type": "markdown",
443 | "metadata": {},
444 | "source": [
445 | "#### 1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": null,
451 | "metadata": {},
452 | "outputs": [],
453 | "source": [
454 | "'''\n",
455 | "看看泰坦尼克号数据集中 票价 这列数据的基本统计数据\n",
456 | "'''"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": 4,
462 | "metadata": {},
463 | "outputs": [],
464 | "source": [
465 | "#代码\n"
466 | ]
467 | },
468 | {
469 | "cell_type": "markdown",
470 | "metadata": {},
471 | "source": [
472 | "【思考】从上面数据我们可以看出,试试在下面写出你的看法。然后看看我们给出的答案。\n",
473 | "\n",
474 | "当然,答案只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。"
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "**多做几个组数据的统计,看看你能分析出什么?**"
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": 6,
487 | "metadata": {},
488 | "outputs": [],
489 | "source": [
490 | "# 写下你的其他分析\n",
491 | "\n",
492 | "\n"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "【思考】有更多想法,欢迎写在你的学习笔记中。"
500 | ]
501 | },
502 | {
503 | "cell_type": "markdown",
504 | "metadata": {},
505 | "source": [
506 | "【总结】本节中我们通过Pandas的一些内置函数对数据进行了初步统计查看,这个过程最重要的不是大家得掌握这些函数,而是看懂从这些函数出来的数据,构建自己的数据分析思维,这也是第一章最重要的点,希望大家学完第一章能对数据有个基本认识,了解自己在做什么,为什么这么做,后面的章节我们将开始对数据进行清洗,进一步分析。"
507 | ]
508 | }
509 | ],
510 | "metadata": {
511 | "kernelspec": {
512 | "display_name": "Python 3",
513 | "language": "python",
514 | "name": "python3"
515 | },
516 | "language_info": {
517 | "codemirror_mode": {
518 | "name": "ipython",
519 | "version": 3
520 | },
521 | "file_extension": ".py",
522 | "mimetype": "text/x-python",
523 | "name": "python",
524 | "nbconvert_exporter": "python",
525 | "pygments_lexer": "ipython3",
526 | "version": "3.6.12"
527 | },
528 | "toc": {
529 | "base_numbering": 1,
530 | "nav_menu": {},
531 | "number_sections": false,
532 | "sideBar": true,
533 | "skip_h1_title": false,
534 | "title_cell": "Table of Contents",
535 | "title_sidebar": "Contents",
536 | "toc_cell": false,
537 | "toc_position": {
538 | "height": "calc(100% - 180px)",
539 | "left": "10px",
540 | "top": "150px",
541 | "width": "433px"
542 | },
543 | "toc_section_display": true,
544 | "toc_window_display": true
545 | }
546 | },
547 | "nbformat": 4,
548 | "nbformat_minor": 4
549 | }
550 |
--------------------------------------------------------------------------------
/第一单元项目集合/第一章:第二节pandas基础-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是**了解字段含义以及初步观察数据**。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 1 第一章:数据载入及初步观察"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### 1.4 知道你的数据叫什么\n",
22 | "我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "**开始前导入numpy和pandas**"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 25,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "import numpy as np\n",
39 | "import pandas as pd"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "#### 1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 29,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "#写入代码\n"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "'''\n",
65 | "#我们举的例子\n",
66 | "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n",
67 | "example_1 = pd.Series(sdata)\n",
68 | "example_1\n",
69 | "'''"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "'''\n",
79 | "#我们举的例子\n",
80 | "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n",
81 | " 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}\n",
82 | "example_2 = pd.DataFrame(data)\n",
83 | "example_2\n",
84 | "'''\n",
85 | "\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "#### 1.4.2 任务二:根据上节课的方法载入\"train.csv\"文件\n"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 29,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "#写入代码\n"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "也可以加载上一节课保存的\"train_chinese.csv\"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作\n",
109 | "#### 1.4.3 任务三:查看DataFrame数据的每列的名称"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 29,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "#写入代码\n"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "#### 1.4.4任务四:查看\"Cabin\"这列的所有值[有多种方法]"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 29,
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "#写入代码\n"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 29,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "#写入代码\n"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "#### 1.4.5 任务五:加载文件\"test_1.csv\",然后对比\"train.csv\",看看有哪些多出的列,然后将多出的列删除\n",
151 | "经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 29,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "#写入代码\n"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 29,
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "#写入代码\n"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "【思考】还有其他的删除多余的列的方式吗?"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 1,
182 | "metadata": {},
183 | "outputs": [],
184 | "source": [
185 | "# 思考回答\n",
186 | "\n",
187 | "\n",
188 | "\n",
189 | "\n"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "#### 1.4.6 任务六: 将['PassengerId','Name','Age','Ticket']这几个列元素隐藏,只观察其他几个列元素"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 29,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "#写入代码\n"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "【思考回答】\n",
220 | "\n",
221 | "如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "### 1.5 筛选的逻辑"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。\n",
236 | "\n",
237 | "下面我们还是用实战来学习pandas这个功能。"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "#### 1.5.1 任务一: 我们以\"Age\"为筛选条件,显示年龄在10岁以下的乘客信息。"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 29,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "#写入代码\n"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "#### 1.5.2 任务二: 以\"Age\"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 29,
266 | "metadata": {},
267 | "outputs": [],
268 | "source": [
269 | "#写入代码\n"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "#### 1.5.3 任务三:将midage的数据中第100行的\"Pclass\"和\"Sex\"的数据显示出来"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 29,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": [
292 | "#写入代码\n"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "#### 1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 29,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "#写入代码\n"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "#### 1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的\"Pclass\",\"Name\"和\"Sex\"的数据显示出来"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 29,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "#写入代码\n"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "【思考】对比`iloc`和`loc`的异同"
339 | ]
340 | }
341 | ],
342 | "metadata": {
343 | "kernelspec": {
344 | "display_name": "Python 3",
345 | "language": "python",
346 | "name": "python3"
347 | },
348 | "language_info": {
349 | "codemirror_mode": {
350 | "name": "ipython",
351 | "version": 3
352 | },
353 | "file_extension": ".py",
354 | "mimetype": "text/x-python",
355 | "name": "python",
356 | "nbconvert_exporter": "python",
357 | "pygments_lexer": "ipython3",
358 | "version": "3.8.10"
359 | },
360 | "toc": {
361 | "base_numbering": 1,
362 | "nav_menu": {},
363 | "number_sections": false,
364 | "sideBar": true,
365 | "skip_h1_title": false,
366 | "title_cell": "Table of Contents",
367 | "title_sidebar": "Contents",
368 | "toc_cell": false,
369 | "toc_position": {
370 | "height": "calc(100% - 180px)",
371 | "left": "10px",
372 | "top": "150px",
373 | "width": "582px"
374 | },
375 | "toc_section_display": true,
376 | "toc_window_display": true
377 | }
378 | },
379 | "nbformat": 4,
380 | "nbformat_minor": 4
381 | }
382 |
--------------------------------------------------------------------------------
/第三章项目集合/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/.DS_Store
--------------------------------------------------------------------------------
/第三章项目集合/.ipynb_checkpoints/第三章模型建立和评估---评价--课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## 第三章 模型搭建和评估"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "经过前面的探索性数据分析我们可以很清楚的了解到数据集的情况"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 18,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "import pandas as pd\n",
24 | "import numpy as np\n",
25 | "import seaborn as sns\n",
26 | "import matplotlib.pyplot as plt\n",
27 | "from IPython.display import Image\n",
28 | "from sklearn.linear_model import LogisticRegression\n",
29 | "from sklearn.ensemble import RandomForestClassifier"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 19,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "%matplotlib inline"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 20,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签\n",
48 | "plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号\n",
49 | "plt.rcParams['figure.figsize'] = (10, 6) # 设置输出图片大小"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "**重新分割测试集和训练集**"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 21,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "from sklearn.model_selection import train_test_split"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 23,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/plain": [
76 | "((668, 12), (667, 2))"
77 | ]
78 | },
79 | "execution_count": 23,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "X_train.shape, y_train.shape\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 15,
91 | "metadata": {},
92 | "outputs": [
93 | {
94 | "name": "stderr",
95 | "output_type": "stream",
96 | "text": [
97 | "/Users/chenandong/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
98 | " FutureWarning)\n"
99 | ]
100 | },
101 | {
102 | "ename": "ValueError",
103 | "evalue": "bad input shape (667, 2)",
104 | "output_type": "error",
105 | "traceback": [
106 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
107 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
108 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# 默认参数逻辑回归模型\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mlr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mlr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
109 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 1530\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1531\u001b[0m X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order=\"C\",\n\u001b[0;32m-> 1532\u001b[0;31m accept_large_sparse=solver != 'liblinear')\n\u001b[0m\u001b[1;32m 1533\u001b[0m \u001b[0mcheck_classification_targets\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1534\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclasses_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
110 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 722\u001b[0m dtype=None)\n\u001b[1;32m 723\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 724\u001b[0;31m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcolumn_or_1d\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwarn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 725\u001b[0m \u001b[0m_assert_all_finite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 726\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0my_numeric\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'O'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
111 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcolumn_or_1d\u001b[0;34m(y, warn)\u001b[0m\n\u001b[1;32m 758\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mravel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 759\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 760\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"bad input shape {0}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 761\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 762\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
112 | "\u001b[0;31mValueError\u001b[0m: bad input shape (667, 2)"
113 | ]
114 | }
115 | ],
116 | "source": [
117 | "# 默认参数逻辑回归模型\n",
118 | "lr = LogisticRegression()\n",
119 | "lr.fit(X_train, y_train)"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "### 模型评估"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "* 模型评估是为了知道模型的泛化能力。\n",
134 | "* 交叉验证(cross-validation)是一种评估泛化性能的统计学方法,它比单次划分训练集和测试集的方法更加稳定、全面。\n",
135 | "* 在交叉验证中,数据被多次划分,并且需要训练多个模型。\n",
136 | "* 最常用的交叉验证是 k 折交叉验证(k-fold cross-validation),其中 k 是由用户指定的数字,通常取 5 或 10。\n",
137 | "* 准确率(precision)度量的是被预测为正例的样本中有多少是真正的正例\n",
138 | "* 召回率(recall)度量的是正类样本中有多少被预测为正类\n",
139 | "* f-分数是准确率与召回率的调和平均"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "#### 任务一:交叉验证\n",
147 | "* 用10折交叉验证来评估逻辑回归模型\n",
148 | "* 计算交叉验证精度的平均值"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "Image('Snipaste_2020-01-05_16-37-56.png')"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "#### 提示4\n",
165 | "* 交叉验证在sklearn中的模块为`sklearn.model_selection`"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "from sklearn.model_selection import cross_val_score"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {},
181 | "outputs": [],
182 | "source": [
183 | "lr = LogisticRegression(C=100)\n",
184 | "scores = cross_val_score(lr, X_train, y_train, cv=10)"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "# k折交叉验证分数\n",
194 | "scores"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "scrolled": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "# 平均交叉验证分数\n",
206 | "print(\"Average cross-validation score: {:.2f}\".format(scores.mean()))"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "#### 思考4\n",
214 | "* k折越多的情况下会带来什么样的影响?"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "#### 任务二:混淆矩阵\n",
222 | "* 计算二分类问题的混淆矩阵\n",
223 | "* 计算精确率、召回率以及f-分数"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "metadata": {},
230 | "outputs": [],
231 | "source": [
232 | "Image('Snipaste_2020-01-05_16-38-26.png')"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": null,
238 | "metadata": {},
239 | "outputs": [],
240 | "source": [
241 | "Image('Snipaste_2020-01-05_16-39-27.png')"
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "#### 提示5\n",
249 | "* 混淆矩阵的方法在sklearn中的`sklearn.metrics`模块\n",
250 | "* 混淆矩阵需要输入真实标签和预测标签"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {},
257 | "outputs": [],
258 | "source": [
259 | "from sklearn.metrics import confusion_matrix"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": null,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": [
268 | "# 训练模型\n",
269 | "lr = LogisticRegression(C=100)\n",
270 | "lr.fit(X_train, y_train)"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "metadata": {},
277 | "outputs": [],
278 | "source": [
279 | "# 模型预测结果\n",
280 | "pred = lr.predict(X_train)"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {},
287 | "outputs": [],
288 | "source": [
289 | "# 混淆矩阵\n",
290 | "confusion_matrix(y_train, pred)"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": null,
296 | "metadata": {},
297 | "outputs": [],
298 | "source": [
299 | "from sklearn.metrics import classification_report"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {},
306 | "outputs": [],
307 | "source": [
308 | "# 精确率、召回率以及f1-score\n",
309 | "print(classification_report(y_train, pred))"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {},
316 | "outputs": [],
317 | "source": []
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "#### 思考5\n",
324 | "* 如果自己实现混淆矩阵的时候该注意什么问题"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {},
330 | "source": [
331 | "#### 任务三:ROC曲线\n",
332 | "* 绘制ROC曲线"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "#### 提示6\n",
340 | "* ROC曲线在sklearn中的模块为`sklearn.metrics`\n",
341 | "* ROC曲线下面所包围的面积越大越好"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {},
348 | "outputs": [],
349 | "source": [
350 | "from sklearn.metrics import roc_curve"
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {
357 | "scrolled": true
358 | },
359 | "outputs": [],
360 | "source": [
361 | "fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))\n",
362 | "plt.plot(fpr, tpr, label=\"ROC Curve\")\n",
363 | "plt.xlabel(\"FPR\")\n",
364 | "plt.ylabel(\"TPR (recall)\")\n",
365 | "# 找到最接近于0的阈值\n",
366 | "close_zero = np.argmin(np.abs(thresholds))\n",
367 | "plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label=\"threshold zero\", fillstyle=\"none\", c='k', mew=2)\n",
368 | "plt.legend(loc=4)"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "#### 思考6\n",
376 | "* 对于多分类问题如何绘制ROC曲线"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": null,
382 | "metadata": {},
383 | "outputs": [],
384 | "source": []
385 | }
386 | ],
387 | "metadata": {
388 | "kernelspec": {
389 | "display_name": "Python 3",
390 | "language": "python",
391 | "name": "python3"
392 | },
393 | "language_info": {
394 | "codemirror_mode": {
395 | "name": "ipython",
396 | "version": 3
397 | },
398 | "file_extension": ".py",
399 | "mimetype": "text/x-python",
400 | "name": "python",
401 | "nbconvert_exporter": "python",
402 | "pygments_lexer": "ipython3",
403 | "version": "3.7.3"
404 | },
405 | "toc": {
406 | "base_numbering": 1,
407 | "nav_menu": {},
408 | "number_sections": true,
409 | "sideBar": true,
410 | "skip_h1_title": false,
411 | "title_cell": "Table of Contents",
412 | "title_sidebar": "Contents",
413 | "toc_cell": false,
414 | "toc_position": {
415 | "height": "calc(100% - 180px)",
416 | "left": "10px",
417 | "top": "150px",
418 | "width": "384px"
419 | },
420 | "toc_section_display": true,
421 | "toc_window_display": true
422 | }
423 | },
424 | "nbformat": 4,
425 | "nbformat_minor": 2
426 | }
427 |
--------------------------------------------------------------------------------
/第三章项目集合/1578213400(1).jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/1578213400(1).jpg
--------------------------------------------------------------------------------
/第三章项目集合/20170624105439491.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/20170624105439491.png
--------------------------------------------------------------------------------
/第三章项目集合/Snipaste_2020-01-05_16-37-56.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/Snipaste_2020-01-05_16-37-56.png
--------------------------------------------------------------------------------
/第三章项目集合/Snipaste_2020-01-05_16-38-26.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/Snipaste_2020-01-05_16-38-26.png
--------------------------------------------------------------------------------
/第三章项目集合/Snipaste_2020-01-05_16-39-27.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/Snipaste_2020-01-05_16-39-27.png
--------------------------------------------------------------------------------
/第三章项目集合/sklearn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/sklearn.png
--------------------------------------------------------------------------------
/第三章项目集合/sklearn01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第三章项目集合/sklearn01.png
--------------------------------------------------------------------------------
/第三章项目集合/大纲.md:
--------------------------------------------------------------------------------
1 | ##读取数据
2 |
3 | ## 探索数据分析
4 |
5 | ### 数值分布
6 |
7 | ### 缺失值
8 |
9 | ### 异常值
10 |
11 | ### 相关性
12 |
13 |
14 |
15 | ## 特征工程
16 |
17 | ### 稀疏特征
18 |
19 | ### 稠密特征
20 |
21 | ## 模型训练
22 |
23 | ### 模型搭建
24 |
25 | ### 模型评估
26 |
27 | ### 模型融合
28 |
29 |
30 |
31 |
32 |
33 |
--------------------------------------------------------------------------------
/第二章项目集合/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第二章项目集合/.DS_Store
--------------------------------------------------------------------------------
/第二章项目集合/.ipynb_checkpoints/第二章:第一节数据清洗及特征处理-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "#加载所需的库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "#加载数据train.csv\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## 2 第二章:数据清洗及特征处理\n",
40 | "我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "### 2.1 缺失值观察与处理\n",
48 | "我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "#### 2.1.1 任务一:缺失值观察\n",
56 | "(1) 请查看每个特征缺失值个数 \n",
57 | "(2) 请查看Age, Cabin, Embarked列的数据\n",
58 | "以上方式都有多种方式,所以大家多多益善"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "#写入代码\n",
68 | "\n",
69 | "\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "#写入代码\n",
79 | "\n",
80 | "\n"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "#写入代码\n",
90 | "\n",
91 | "\n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "#### 2.1.2 任务二:对缺失值进行处理\n",
99 | "(1)处理缺失值一般有几种思路\n",
100 | "\n",
101 | "(2) 请尝试对Age列的数据的缺失值进行处理\n",
102 | "\n",
103 | "(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理 \n"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "#处理缺失值的一般思路:\n",
113 | "#提醒:可使用的函数有--->dropna函数与fillna函数\n",
114 | "\n",
115 | "\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {},
122 | "outputs": [],
123 | "source": [
124 | "#写入代码\n",
125 | "\n",
126 | "\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": [
135 | "#写入代码\n",
136 | "\n",
137 | "\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "#写入代码\n",
147 | "\n",
148 | "\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "【思考1】dropna和fillna有哪些参数,分别如何使用呢? "
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "【思考】检索空缺值用`np.nan`,`None`以及`.isnull()`哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 1,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#思考回答\n",
172 | "\n",
173 | "\n"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "### 2.2 重复值观察与处理\n",
195 | "由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "#### 2.2.1 任务一:请查看数据中的重复值"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "#写入代码\n",
212 | "\n",
213 | "\n"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "#### 2.2.2 任务二:对重复值进行处理\n",
221 | "(1)重复值有哪些处理方式呢?\n",
222 | "\n",
223 | "(2)处理我们数据的重复值\n",
224 | "\n",
225 | "方法多多益善"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "#重复值有哪些处理方式:\n",
235 | "\n",
236 | "\n"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "#写入代码\n",
246 | "\n",
247 | "\n"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "#### 2.2.3 任务三:将前面清洗的数据保存为csv格式"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "#写入代码\n",
264 | "\n",
265 | "\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "### 2.3 特征观察与处理\n",
273 | "我们对特征进行一下观察,可以把特征大概分为两大类: \n",
274 | "数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征 \n",
275 | "文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "#### 2.3.1 任务一:对年龄进行分箱(离散化)处理\n",
283 | "(1) 分箱操作是什么?\n",
284 | "\n",
285 | "(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示 \n",
286 | "\n",
287 | "(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示 \n",
288 | "\n",
289 | "(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示\n",
290 | "\n",
291 | "(5) 将上面的获得的数据分别进行保存,保存为csv格式"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "#分箱操作是什么:\n",
301 | "\n",
302 | "\n"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "#写入代码\n",
312 | "\n",
313 | "\n"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "#写入代码\n",
323 | "\n",
324 | "\n"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {},
331 | "outputs": [],
332 | "source": [
333 | "#写入代码\n",
334 | "\n",
335 | "\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html"
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "#### 2.3.2 任务二:对文本变量进行转换\n",
357 | "(1) 查看文本变量名及种类 \n",
358 | "(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示 \n",
359 | "(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "#写入代码\n",
369 | "\n",
370 | "\n"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": null,
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "#写入代码\n",
380 | "\n",
381 | "\n"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": null,
387 | "metadata": {},
388 | "outputs": [],
389 | "source": [
390 | "#写入代码\n",
391 | "\n",
392 | "\n"
393 | ]
394 | },
395 | {
396 | "cell_type": "markdown",
397 | "metadata": {},
398 | "source": [
399 | "#### 2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": null,
405 | "metadata": {},
406 | "outputs": [],
407 | "source": [
408 | "#写入代码\n",
409 | "\n",
410 | "\n"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {},
417 | "outputs": [],
418 | "source": [
419 | "#保存最终你完成的已经清理好的数据\n"
420 | ]
421 | }
422 | ],
423 | "metadata": {
424 | "kernelspec": {
425 | "display_name": "Python 3",
426 | "language": "python",
427 | "name": "python3"
428 | },
429 | "language_info": {
430 | "codemirror_mode": {
431 | "name": "ipython",
432 | "version": 3
433 | },
434 | "file_extension": ".py",
435 | "mimetype": "text/x-python",
436 | "name": "python",
437 | "nbconvert_exporter": "python",
438 | "pygments_lexer": "ipython3",
439 | "version": "3.6.12"
440 | },
441 | "toc": {
442 | "base_numbering": 1,
443 | "nav_menu": {},
444 | "number_sections": false,
445 | "sideBar": true,
446 | "skip_h1_title": false,
447 | "title_cell": "Table of Contents",
448 | "title_sidebar": "Contents",
449 | "toc_cell": false,
450 | "toc_position": {
451 | "height": "calc(100% - 180px)",
452 | "left": "10px",
453 | "top": "150px",
454 | "width": "433px"
455 | },
456 | "toc_section_display": true,
457 | "toc_window_display": true
458 | }
459 | },
460 | "nbformat": 4,
461 | "nbformat_minor": 4
462 | }
463 |
--------------------------------------------------------------------------------
/第二章项目集合/.ipynb_checkpoints/第二章:第三节数据重构2-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了**数据的清洗**,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 3,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# 导入基本库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 载入上一个任务人保存的文件中:result.csv,并查看这个文件\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# 2 第二章:数据重构\n"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## 第一部分:数据聚合与运算"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "### 2.6 数据运用"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "#### 2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 5,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "#写入心得\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "#### 2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 2,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 写入代码\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。\n",
93 | "\n",
94 | "下面通过几个任务来熟悉GroupBy机制。"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "#### 2.4.3:任务三:统计泰坦尼克号中男女的存活人数"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 2,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "# 写入代码\n"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "#### 2.4.4:任务四:计算客舱不同等级的存活人数"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 2,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "# 写入代码\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "【**提示:**】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "【**思考**】从数据分析的角度,上面的统计结果可以得出那些结论"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 9,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "#思考心得 \n",
150 | "\n"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 1,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "#思考心得\n",
167 | "\n",
168 | "\n"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "#### 2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 2,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": [
184 | "# 写入代码\n"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "#### 2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 2,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "# 写入代码\n"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "#### 2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数的最高的年龄,最后计算存活人数最高的存活率(存活人数/总人数)\n"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 2,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "# 写入代码\n"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 2,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "# 写入代码\n"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 2,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# 写入代码\n"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 2,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "# 写入代码\n"
244 | ]
245 | }
246 | ],
247 | "metadata": {
248 | "kernelspec": {
249 | "display_name": "Python 3",
250 | "language": "python",
251 | "name": "python3"
252 | },
253 | "language_info": {
254 | "codemirror_mode": {
255 | "name": "ipython",
256 | "version": 3
257 | },
258 | "file_extension": ".py",
259 | "mimetype": "text/x-python",
260 | "name": "python",
261 | "nbconvert_exporter": "python",
262 | "pygments_lexer": "ipython3",
263 | "version": "3.6.12"
264 | },
265 | "toc": {
266 | "base_numbering": 1,
267 | "nav_menu": {},
268 | "number_sections": false,
269 | "sideBar": true,
270 | "skip_h1_title": false,
271 | "title_cell": "Table of Contents",
272 | "title_sidebar": "Contents",
273 | "toc_cell": false,
274 | "toc_position": {
275 | "height": "calc(100% - 180px)",
276 | "left": "10px",
277 | "top": "150px",
278 | "width": "582px"
279 | },
280 | "toc_section_display": true,
281 | "toc_window_display": true
282 | }
283 | },
284 | "nbformat": 4,
285 | "nbformat_minor": 4
286 | }
287 |
--------------------------------------------------------------------------------
/第二章项目集合/.ipynb_checkpoints/第二章:第二节数据重构1-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了**数据的清洗**,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 2,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# 导入基本库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 载入data文件中的:train-left-up.csv\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# 2 第二章:数据重构\n"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "### 2.4 数据的合并"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "#### 2.4.1 任务一:将data文件夹里面的所有数据都载入,观察数据的之间的关系"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "#写入代码\n",
63 | "\n"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "#写入代码\n",
73 | "\n"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "#### 2.4.2:任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "#写入代码\n",
97 | "\n"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "#### 2.4.3 任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "#写入代码\n",
114 | "\n"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "#### 2.4.4 任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "#写入代码\n",
131 | "\n"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "#### 2.4.5 任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "#写入代码\n",
148 | "\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "#### 2.4.6 任务六:完成的数据保存为result.csv"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#写入代码\n",
172 | "\n"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "### 2.5 换一种角度看数据"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "#### 2.5.1 任务一:将我们的数据变为Series类型的数据"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "#写入代码\n",
196 | "\n"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "#写入代码\n",
206 | "\n"
207 | ]
208 | }
209 | ],
210 | "metadata": {
211 | "kernelspec": {
212 | "display_name": "Python 3",
213 | "language": "python",
214 | "name": "python3"
215 | },
216 | "language_info": {
217 | "codemirror_mode": {
218 | "name": "ipython",
219 | "version": 3
220 | },
221 | "file_extension": ".py",
222 | "mimetype": "text/x-python",
223 | "name": "python",
224 | "nbconvert_exporter": "python",
225 | "pygments_lexer": "ipython3",
226 | "version": "3.7.3"
227 | },
228 | "toc": {
229 | "base_numbering": 1,
230 | "nav_menu": {},
231 | "number_sections": false,
232 | "sideBar": true,
233 | "skip_h1_title": false,
234 | "title_cell": "Table of Contents",
235 | "title_sidebar": "Contents",
236 | "toc_cell": false,
237 | "toc_position": {
238 | "height": "calc(100% - 180px)",
239 | "left": "10px",
240 | "top": "150px",
241 | "width": "582px"
242 | },
243 | "toc_section_display": true,
244 | "toc_window_display": true
245 | }
246 | },
247 | "nbformat": 4,
248 | "nbformat_minor": 2
249 | }
250 |
--------------------------------------------------------------------------------
/第二章项目集合/.ipynb_checkpoints/第二章:第四节数据可视化-课程-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**回顾学习完第一章,我们对泰坦尼克号数据有了基本的了解,也学到了一些基本的统计方法,第二章中我们学习了数据的清理和重构,使得数据更加的易于理解;今天我们要学习的是第二章第三节:**数据可视化**,主要给大家介绍一下Python数据可视化库Matplotlib,在本章学习中,你也许会觉得数据很有趣。在打比赛的过程中,数据可视化可以让我们更好的看到每一个关键步骤的结果如何,可以用来优化方案,是一个很有用的技巧。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# 2 第二章:数据可视化"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "#### 开始之前,导入numpy、pandas以及matplotlib包和数据"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "# 加载所需的库\n",
31 | "# 如果出现 ModuleNotFoundError: No module named 'xxxx'\n",
32 | "# 你只需要在终端/cmd下 pip install xxxx 即可\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "#加载result.csv这个数据\n"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "### 2.7 如何让人一眼看懂你的数据?\n",
49 | "《Python for Data Analysis》第九章"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "#### 2.7.1 任务一:跟着书本第九章,了解matplotlib,自己创建一个数据项,对其进行基本可视化"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "【思考】最基本的可视化图案有哪些?分别适用于那些场景?(比如折线图适合可视化某个属性值随时间变化的走势)"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "#思考回答\n",
73 | "#这一部分需要了解可视化图案的的逻辑,知道什么样的图案可以表达什么样的信号b\n"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "#### 2.7.2 任务二:可视化展示泰坦尼克号数据集中男女中生存人数分布情况(用柱状图试试)。"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 4,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "#代码编写\n",
90 | "\n"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "【思考】计算出泰坦尼克号数据集中男女中死亡人数,并可视化展示?如何和男女生存人数可视化柱状图结合到一起?看到你的数据可视化,说说你的第一感受(比如:你一眼看出男生存活人数更多,那么性别可能会影响存活率)。"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "#思考题回答\n",
107 | "\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "#### 2.7.3 任务三:可视化展示泰坦尼克号数据集中男女中生存人与死亡人数的比例图(用柱状图试试)。"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 7,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "#代码编写\n",
124 | "# 提示:计算男女中死亡人数 1表示生存,0表示死亡\n",
125 | "\n",
126 | "\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "【提示】男女这两个数据轴,存活和死亡人数按比例用柱状图表示"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "#### 2.7.4 任务四:可视化展示泰坦尼克号数据集中不同票价的人生存和死亡人数分布情况。(用折线图试试)(横轴是不同票价,纵轴是存活人数)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "【提示】对于这种统计性质的且用折线表示的数据,你可以考虑将数据排序或者不排序来分别表示。看看你能发现什么?"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 4,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "#代码编写\n",
157 | "# 计算不同票价中生存与死亡人数 1表示生存,0表示死亡\n",
158 | "\n",
159 | "\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {
165 | "scrolled": true
166 | },
167 | "source": [
168 | "#### 2.7.5 任务五:可视化展示泰坦尼克号数据集中不同仓位等级的人生存和死亡人员的分布情况。(用柱状图试试)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 8,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "#代码编写\n",
178 | "# 1表示生存,0表示死亡\n",
179 | "\n",
180 | "\n"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "【思考】看到这个前面几个数据可视化,说说你的第一感受和你的总结"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": null,
193 | "metadata": {},
194 | "outputs": [],
195 | "source": [
196 | "#思考题回答\n",
197 | "\n"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "#### 2.7.6 任务六:可视化展示泰坦尼克号数据集中不同年龄的人生存与死亡人数分布情况。(不限表达方式)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 4,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "#代码编写\n",
214 | "\n"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "#### 2.7.7 任务七:可视化展示泰坦尼克号数据集中不同仓位等级的人年龄分布情况。(用折线图试试)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 4,
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "#代码编写\n",
231 | "\n"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "【思考】上面所有可视化的例子做一个总体的分析,你看看你能不能有自己发现"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {},
245 | "outputs": [],
246 | "source": [
247 | "#思考题回答\n",
248 | "\n"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "【总结】到这里,我们的可视化就告一段落啦,如果你对数据可视化极其感兴趣,你还可以了解一下其他可视化模块,如:pyecharts,bokeh等。\n",
256 | "\n",
257 | "如果你在工作中使用数据可视化,你必须知道数据可视化最大的作用不是炫酷,而是最快最直观的理解数据要表达什么,你觉得呢?"
258 | ]
259 | }
260 | ],
261 | "metadata": {
262 | "kernelspec": {
263 | "display_name": "Python 3",
264 | "language": "python",
265 | "name": "python3"
266 | },
267 | "language_info": {
268 | "codemirror_mode": {
269 | "name": "ipython",
270 | "version": 3
271 | },
272 | "file_extension": ".py",
273 | "mimetype": "text/x-python",
274 | "name": "python",
275 | "nbconvert_exporter": "python",
276 | "pygments_lexer": "ipython3",
277 | "version": "3.7.3"
278 | }
279 | },
280 | "nbformat": 4,
281 | "nbformat_minor": 2
282 | }
283 |
--------------------------------------------------------------------------------
/第二章项目集合/data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datawhalechina/hands-on-data-analysis/a22033b8feadbf8e1f937ac5083c3f7ba2fba5db/第二章项目集合/data/.DS_Store
--------------------------------------------------------------------------------
/第二章项目集合/data/train-right-down.csv:
--------------------------------------------------------------------------------
1 | Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2 | male,31,0,0,C.A. 18723,10.5,,S
3 | female,45,1,1,F.C.C. 13529,26.25,,S
4 | male,20,0,0,345769,9.5,,S
5 | male,25,1,0,347076,7.775,,S
6 | female,28,0,0,230434,13,,S
7 | male,,0,0,65306,8.1125,,S
8 | male,4,0,2,33638,81.8583,A34,S
9 | female,13,0,1,250644,19.5,,S
10 | male,34,0,0,113794,26.55,,S
11 | female,5,2,1,2666,19.2583,,C
12 | male,52,0,0,113786,30.5,C104,S
13 | male,36,1,2,C.A. 34651,27.75,,S
14 | male,,1,0,65303,19.9667,,S
15 | male,30,0,0,113051,27.75,C111,C
16 | male,49,1,0,17453,89.1042,C92,C
17 | male,,0,0,A/5 2817,8.05,,S
18 | male,29,0,0,349240,7.8958,,C
19 | male,65,0,0,13509,26.55,E38,S
20 | female,,1,0,17464,51.8625,D21,S
21 | female,50,0,0,F.C.C. 13531,10.5,,S
22 | male,,0,0,371060,7.75,,Q
23 | male,48,0,0,19952,26.55,E12,S
24 | male,34,0,0,364506,8.05,,S
25 | male,47,0,0,111320,38.5,E63,S
26 | male,48,0,0,234360,13,,S
27 | male,,0,0,A/S 2816,8.05,,S
28 | male,38,0,0,SOTON/O.Q. 3101306,7.05,,S
29 | male,,0,0,239853,0,,S
30 | male,56,0,0,113792,26.55,,S
31 | male,,0,0,36209,7.725,,Q
32 | female,0.75,2,1,2666,19.2583,,C
33 | male,,0,0,323592,7.25,,S
34 | male,38,0,0,315089,8.6625,,S
35 | female,33,1,2,C.A. 34651,27.75,,S
36 | female,23,0,0,SC/AH Basle 541,13.7917,D,C
37 | female,22,0,0,7553,9.8375,,S
38 | male,,0,0,110465,52,A14,S
39 | male,34,1,0,31027,21,,S
40 | male,29,1,0,3460,7.0458,,S
41 | male,22,0,0,350060,7.5208,,S
42 | female,2,0,1,3101298,12.2875,,S
43 | male,9,5,2,CA 2144,46.9,,S
44 | male,,0,0,239854,0,,S
45 | male,50,0,0,A/5 3594,8.05,,S
46 | female,63,0,0,4134,9.5875,,S
47 | male,25,1,0,11967,91.0792,B49,C
48 | female,,3,1,4133,25.4667,,S
49 | female,35,1,0,19943,90,C93,S
50 | male,58,0,0,11771,29.7,B37,C
51 | male,30,0,0,A.5. 18509,8.05,,S
52 | male,9,1,1,C.A. 37671,15.9,,S
53 | male,,1,0,65304,19.9667,,S
54 | male,21,0,0,SOTON/OQ 3101317,7.25,,S
55 | male,55,0,0,113787,30.5,C30,S
56 | male,71,0,0,PC 17609,49.5042,,C
57 | male,21,0,0,A/4 45380,8.05,,S
58 | male,,0,0,2627,14.4583,,C
59 | female,54,1,0,36947,78.2667,D20,C
60 | male,,0,0,C.A. 6212,15.1,,S
61 | female,25,1,2,113781,151.55,C22 C26,S
62 | male,24,0,0,350035,7.7958,,S
63 | male,17,0,0,315086,8.6625,,S
64 | female,21,0,0,364846,7.75,,Q
65 | female,,0,0,330909,7.6292,,Q
66 | female,37,0,0,4135,9.5875,,S
67 | female,16,0,0,110152,86.5,B79,S
68 | male,18,1,0,PC 17758,108.9,C65,C
69 | female,33,0,2,26360,26,,S
70 | male,,0,0,111427,26.55,,S
71 | male,28,0,0,C 4001,22.525,,S
72 | male,26,0,0,1601,56.4958,,S
73 | male,29,0,0,382651,7.75,,Q
74 | male,,0,0,SOTON/OQ 3101316,8.05,,S
75 | male,36,0,0,PC 17473,26.2875,E25,S
76 | female,54,1,0,PC 17603,59.4,,C
77 | male,24,0,0,349209,7.4958,,S
78 | male,47,0,0,36967,34.0208,D46,S
79 | female,34,0,0,C.A. 34260,10.5,F33,S
80 | male,,0,0,371110,24.15,,Q
81 | female,36,1,0,226875,26,,S
82 | male,32,0,0,349242,7.8958,,S
83 | female,30,0,0,12749,93.5,B73,S
84 | male,22,0,0,349252,7.8958,,S
85 | male,,0,0,2624,7.225,,C
86 | female,44,0,1,111361,57.9792,B18,C
87 | male,,0,0,2700,7.2292,,C
88 | male,40.5,0,0,367232,7.75,,Q
89 | female,50,0,0,W./C. 14258,10.5,,S
90 | male,,0,0,PC 17483,221.7792,C95,S
91 | male,39,0,0,3101296,7.925,,S
92 | male,23,2,1,29104,11.5,,S
93 | female,2,1,1,26360,26,,S
94 | male,,0,0,2641,7.2292,,C
95 | male,17,1,1,2690,7.2292,,C
96 | female,,0,2,2668,22.3583,,C
97 | female,30,0,0,315084,8.6625,,S
98 | female,7,0,2,F.C.C. 13529,26.25,,S
99 | male,45,0,0,113050,26.55,B38,S
100 | female,30,0,0,PC 17761,106.425,,C
101 | male,,0,0,364498,14.5,,S
102 | female,22,0,2,13568,49.5,B39,C
103 | female,36,0,2,WE/P 5735,71,B22,S
104 | female,9,4,2,347082,31.275,,S
105 | female,11,4,2,347082,31.275,,S
106 | male,32,1,0,2908,26,,S
107 | male,50,1,0,PC 17761,106.425,C86,C
108 | male,64,0,0,693,26,,S
109 | female,19,1,0,2908,26,,S
110 | male,,0,0,SC/PARIS 2146,13.8625,,C
111 | male,33,1,1,363291,20.525,,S
112 | male,8,1,1,C.A. 33112,36.75,,S
113 | male,17,0,2,17421,110.8833,C70,C
114 | male,27,0,0,244358,26,,S
115 | male,,0,0,330979,7.8292,,Q
116 | male,22,0,0,2620,7.225,,C
117 | female,22,0,0,347085,7.775,,S
118 | male,62,0,0,113807,26.55,,S
119 | female,48,1,0,11755,39.6,A16,C
120 | male,,0,0,PC 17757,227.525,,C
121 | female,39,1,1,110413,79.65,E67,S
122 | female,36,1,0,345572,17.4,,S
123 | male,,0,0,372622,7.75,,Q
124 | male,40,0,0,349251,7.8958,,S
125 | male,28,0,0,218629,13.5,,S
126 | male,,0,0,SOTON/OQ 392082,8.05,,S
127 | female,,0,0,SOTON/O.Q. 392087,8.05,,S
128 | male,24,2,0,A/4 48871,24.15,,S
129 | male,19,0,0,349205,7.8958,,S
130 | female,29,0,4,349909,21.075,,S
131 | male,,0,0,2686,7.2292,,C
132 | male,32,0,0,350417,7.8542,,S
133 | male,62,0,0,S.W./PP 752,10.5,,S
134 | female,53,2,0,11769,51.4792,C101,S
135 | male,36,0,0,PC 17474,26.3875,E25,S
136 | female,,0,0,14312,7.75,,Q
137 | male,16,0,0,A/4. 20589,8.05,,S
138 | male,19,0,0,358585,14.5,,S
139 | female,34,0,0,243880,13,,S
140 | female,39,1,0,13507,55.9,E44,S
141 | female,,1,0,2689,14.4583,,C
142 | male,32,0,0,STON/O 2. 3101286,7.925,,S
143 | female,25,1,1,237789,30,,S
144 | female,39,1,1,17421,110.8833,C68,C
145 | male,54,0,0,28403,26,,S
146 | male,36,0,0,13049,40.125,A10,C
147 | male,,0,0,3411,8.7125,,C
148 | female,18,0,2,110413,79.65,E68,S
149 | male,47,0,0,237565,15,,S
150 | male,60,1,1,13567,79.2,B41,C
151 | male,22,0,0,14973,8.05,,S
152 | male,,0,0,A./5. 3235,8.05,,S
153 | male,35,0,0,STON/O 2. 3101273,7.125,,S
154 | female,52,1,0,36947,78.2667,D20,C
155 | male,47,0,0,A/5 3902,7.25,,S
156 | female,,0,2,364848,7.75,,Q
157 | male,37,1,0,SC/AH 29037,26,,S
158 | male,36,1,1,345773,24.15,,S
159 | female,,0,0,248727,33,,S
160 | male,49,0,0,LINE,0,,S
161 | male,,0,0,2664,7.225,,C
162 | male,49,1,0,PC 17485,56.9292,A20,C
163 | female,24,2,1,243847,27,,S
164 | male,,0,0,349214,7.8958,,S
165 | male,,0,0,113796,42.4,,S
166 | male,44,0,0,364511,8.05,,S
167 | male,35,0,0,111426,26.55,,C
168 | male,36,1,0,349910,15.55,,S
169 | male,30,0,0,349246,7.8958,,S
170 | male,27,0,0,113804,30.5,,S
171 | female,22,1,2,SC/Paris 2123,41.5792,,C
172 | female,40,0,0,PC 17582,153.4625,C125,S
173 | female,39,1,5,347082,31.275,,S
174 | male,,0,0,SOTON/O.Q. 3101305,7.05,,S
175 | female,,1,0,367230,15.5,,Q
176 | male,,0,0,370377,7.75,,Q
177 | male,35,0,0,364512,8.05,,S
178 | female,24,1,2,220845,65,,S
179 | male,34,1,1,347080,14.4,,S
180 | female,26,1,0,A/5. 3336,16.1,,S
181 | female,4,2,1,230136,39,F4,S
182 | male,26,0,0,31028,10.5,,S
183 | male,27,1,0,2659,14.4542,,C
184 | male,42,1,0,11753,52.5542,D19,S
185 | male,20,1,1,2653,15.7417,,C
186 | male,21,0,0,350029,7.8542,,S
187 | male,21,0,0,54636,16.1,,S
188 | male,61,0,0,36963,32.3208,D50,S
189 | male,57,0,0,219533,12.35,,Q
190 | female,21,0,0,13502,77.9583,D9,S
191 | male,26,0,0,349224,7.8958,,S
192 | male,,0,0,334912,7.7333,,Q
193 | male,80,0,0,27042,30,A23,S
194 | male,51,0,0,347743,7.0542,,S
195 | male,32,0,0,13214,30.5,B50,C
196 | male,,0,0,112052,0,,S
197 | female,9,3,2,347088,27.9,,S
198 | female,28,0,0,237668,13,,S
199 | male,32,0,0,STON/O 2. 3101292,7.925,,S
200 | male,31,1,1,C.A. 31921,26.25,,S
201 | female,41,0,5,3101295,39.6875,,S
202 | male,,1,0,376564,16.1,,S
203 | male,20,0,0,350050,7.8542,,S
204 | female,24,0,0,PC 17477,69.3,B35,C
205 | female,2,3,2,347088,27.9,,S
206 | male,,0,0,1601,56.4958,,S
207 | female,0.75,2,1,2666,19.2583,,C
208 | male,48,1,0,PC 17572,76.7292,D33,C
209 | male,19,0,0,349231,7.8958,,S
210 | male,56,0,0,13213,35.5,A26,C
211 | male,,0,0,S.O./P.P. 751,7.55,,S
212 | female,23,0,0,CA. 2314,7.55,,S
213 | male,,0,0,349221,7.8958,,S
214 | female,18,0,1,231919,23,,S
215 | male,21,0,0,8475,8.4333,,S
216 | female,,0,0,330919,7.8292,,Q
217 | female,18,0,0,365226,6.75,,Q
218 | male,24,2,0,S.O.C. 14879,73.5,,S
219 | male,,0,0,349223,7.8958,,S
220 | female,32,1,1,364849,15.5,,Q
221 | male,23,0,0,29751,13,,S
222 | male,58,0,2,35273,113.275,D48,C
223 | male,50,2,0,PC 17611,133.65,,S
224 | male,40,0,0,2623,7.225,,C
225 | male,47,0,0,5727,25.5875,E58,S
226 | male,36,0,0,349210,7.4958,,S
227 | male,20,1,0,STON/O 2. 3101285,7.925,,S
228 | male,32,2,0,S.O.C. 14879,73.5,,S
229 | male,25,0,0,234686,13,,S
230 | male,,0,0,312993,7.775,,S
231 | male,43,0,0,A/5 3536,8.05,,S
232 | female,,1,0,19996,52,C126,S
233 | female,40,1,1,29750,39,,S
234 | male,31,1,0,F.C. 12750,52,B71,S
235 | male,70,0,0,C.A. 24580,10.5,,S
236 | male,31,0,0,244270,13,,S
237 | male,,0,0,239856,0,,S
238 | male,18,0,0,349912,7.775,,S
239 | male,24.5,0,0,342826,8.05,,S
240 | female,18,0,0,4138,9.8417,,S
241 | female,43,1,6,CA 2144,46.9,,S
242 | male,36,0,1,PC 17755,512.3292,B51 B53 B55,C
243 | female,,0,0,330935,8.1375,,Q
244 | male,27,0,0,PC 17572,76.7292,D49,C
245 | male,20,0,0,6563,9.225,,S
246 | male,14,5,2,CA 2144,46.9,,S
247 | male,60,1,1,29750,39,,S
248 | male,25,1,2,SC/Paris 2123,41.5792,,C
249 | male,14,4,1,3101295,39.6875,,S
250 | male,19,0,0,349228,10.1708,,S
251 | male,18,0,0,350036,7.7958,,S
252 | female,15,0,1,24160,211.3375,B5,S
253 | male,31,1,0,17474,57,B20,S
254 | female,4,0,1,349256,13.4167,,C
255 | male,,0,0,1601,56.4958,,S
256 | male,25,0,0,2672,7.225,,C
257 | male,60,0,0,113800,26.55,,S
258 | male,52,0,0,248731,13.5,,S
259 | male,44,0,0,363592,8.05,,S
260 | female,,0,0,35852,7.7333,,Q
261 | male,49,1,1,17421,110.8833,C68,C
262 | male,42,0,0,348121,7.65,F G63,S
263 | female,18,1,0,PC 17757,227.525,C62 C64,C
264 | male,35,0,0,PC 17475,26.2875,E24,S
265 | female,18,0,1,2691,14.4542,,C
266 | male,25,0,0,36864,7.7417,,Q
267 | male,26,1,0,350025,7.8542,,S
268 | male,39,0,0,250655,26,,S
269 | female,45,0,0,223596,13.5,,S
270 | male,42,0,0,PC 17476,26.2875,E24,S
271 | female,22,0,0,113781,151.55,,S
272 | male,,1,1,2661,15.2458,,C
273 | female,24,0,0,PC 17482,49.5042,C90,C
274 | male,,0,0,113028,26.55,C124,S
275 | male,48,1,0,19996,52,C126,S
276 | male,29,0,0,7545,9.4833,,S
277 | male,52,0,0,250647,13,,S
278 | male,19,0,0,348124,7.65,F G73,S
279 | female,38,0,0,PC 17757,227.525,C45,C
280 | female,27,0,0,34218,10.5,E101,S
281 | male,,0,0,36568,15.5,,Q
282 | male,33,0,0,347062,7.775,,S
283 | female,6,0,1,248727,33,,S
284 | male,17,1,0,350048,7.0542,,S
285 | male,34,0,0,12233,13,,S
286 | male,50,0,0,250643,13,,S
287 | male,27,1,0,113806,53.1,E8,S
288 | male,20,0,0,315094,8.6625,,S
289 | female,30,3,0,31027,21,,S
290 | female,,0,0,36866,7.7375,,Q
291 | male,25,1,0,236853,26,,S
292 | female,25,1,0,STON/O2. 3101271,7.925,,S
293 | female,29,0,0,24160,211.3375,B5,S
294 | male,11,0,0,2699,18.7875,,C
295 | male,,0,0,239855,0,,S
296 | male,23,0,0,28425,13,,S
297 | male,23,0,0,233639,13,,S
298 | male,28.5,0,0,54636,16.1,,S
299 | female,48,1,3,W./C. 6608,34.375,,S
300 | male,35,0,0,PC 17755,512.3292,B101,C
301 | male,,0,0,349201,7.8958,,S
302 | male,,0,0,349218,7.8958,,S
303 | male,,0,0,16988,30,D45,S
304 | male,36,1,0,19877,78.85,C46,S
305 | female,21,2,2,PC 17608,262.375,B57 B59 B63 B66,C
306 | male,24,1,0,376566,16.1,,S
307 | male,31,0,0,STON/O 2. 3101288,7.925,,S
308 | male,70,1,1,WE/P 5735,71,B22,S
309 | male,16,1,1,C.A. 2673,20.25,,S
310 | female,30,0,0,250648,13,,S
311 | male,19,1,0,113773,53.1,D30,S
312 | male,31,0,0,335097,7.75,,Q
313 | female,4,1,1,29103,23,,S
314 | male,6,0,1,392096,12.475,E121,S
315 | male,33,0,0,345780,9.5,,S
316 | male,23,0,0,349204,7.8958,,S
317 | female,48,1,2,220845,65,,S
318 | male,0.67,1,1,250649,14.5,,S
319 | male,28,0,0,350042,7.7958,,S
320 | male,18,0,0,29108,11.5,,S
321 | male,34,0,0,363294,8.05,,S
322 | female,33,0,0,110152,86.5,B77,S
323 | male,,0,0,358585,14.5,,S
324 | male,41,0,0,SOTON/O2 3101272,7.125,,S
325 | male,20,0,0,2663,7.2292,,C
326 | female,36,1,2,113760,120,B96 B98,S
327 | male,16,0,0,347074,7.775,,S
328 | female,51,1,0,13502,77.9583,D11,S
329 | male,,0,0,112379,39.6,,C
330 | female,30.5,0,0,364850,7.75,,Q
331 | male,,1,0,371110,24.15,,Q
332 | male,32,0,0,8471,8.3625,,S
333 | male,24,0,0,345781,9.5,,S
334 | male,48,0,0,350047,7.8542,,S
335 | female,57,0,0,S.O./P.P. 3,10.5,E77,S
336 | male,,0,0,2674,7.225,,C
337 | female,54,1,3,29105,23,,S
338 | male,18,0,0,347078,7.75,,S
339 | male,,0,0,383121,7.75,F38,Q
340 | female,5,0,0,364516,12.475,,S
341 | male,,0,0,36865,7.7375,,Q
342 | female,43,0,1,24160,211.3375,B3,S
343 | female,13,0,0,2687,7.2292,,C
344 | female,17,1,0,17474,57,B20,S
345 | male,29,0,0,113501,30,D6,S
346 | male,,1,2,W./C. 6607,23.45,,S
347 | male,25,0,0,SOTON/O.Q. 3101312,7.05,,S
348 | male,25,0,0,374887,7.25,,S
349 | female,18,0,0,3101265,7.4958,,S
350 | male,8,4,1,382652,29.125,,Q
351 | male,1,1,2,C.A. 2315,20.575,,S
352 | male,46,0,0,PC 17593,79.2,B82 B84,C
353 | male,,0,0,12460,7.75,,Q
354 | male,16,0,0,239865,26,,S
355 | female,,8,2,CA. 2343,69.55,,S
356 | male,,0,0,PC 17600,30.6958,,C
357 | male,25,0,0,349203,7.8958,,S
358 | male,39,0,0,28213,13,,S
359 | female,49,0,0,17465,25.9292,D17,S
360 | female,31,0,0,349244,8.6833,,S
361 | male,30,0,0,2685,7.2292,,C
362 | female,30,1,1,345773,24.15,,S
363 | male,34,0,0,250647,13,,S
364 | female,31,1,1,C.A. 31921,26.25,,S
365 | male,11,1,2,113760,120,B96 B98,S
366 | male,0.42,0,1,2625,8.5167,,C
367 | male,27,0,0,347089,6.975,,S
368 | male,31,0,0,347063,7.775,,S
369 | male,39,0,0,112050,0,A36,S
370 | female,18,0,0,347087,7.775,,S
371 | male,39,0,0,248723,13,,S
372 | female,33,1,0,113806,53.1,E8,S
373 | male,26,0,0,3474,7.8875,,S
374 | male,39,0,0,A/4 48871,24.15,,S
375 | male,35,0,0,28206,10.5,,S
376 | female,6,4,2,347082,31.275,,S
377 | male,30.5,0,0,364499,8.05,,S
378 | male,,0,0,112058,0,B102,S
379 | female,23,0,0,STON/O2. 3101290,7.925,,S
380 | male,31,1,1,S.C./PARIS 2079,37.0042,,C
381 | male,43,0,0,C 7075,6.45,,S
382 | male,10,3,2,347088,27.9,,S
383 | female,52,1,1,12749,93.5,B69,S
384 | male,27,0,0,315098,8.6625,,S
385 | male,38,0,0,19972,0,,S
386 | female,27,0,1,392096,12.475,E121,S
387 | male,2,4,1,3101295,39.6875,,S
388 | male,,0,0,368323,6.95,,Q
389 | male,,0,0,1601,56.4958,,S
390 | male,1,0,2,S.C./PARIS 2079,37.0042,,C
391 | male,,0,0,367228,7.75,,Q
392 | female,62,0,0,113572,80,B28,
393 | female,15,1,0,2659,14.4542,,C
394 | male,0.83,1,1,29106,18.75,,S
395 | male,,0,0,2671,7.2292,,C
396 | male,23,0,0,347468,7.8542,,S
397 | male,18,0,0,2223,8.3,,S
398 | female,39,1,1,PC 17756,83.1583,E49,C
399 | male,21,0,0,315097,8.6625,,S
400 | male,,0,0,392092,8.05,,S
401 | male,32,0,0,1601,56.4958,,S
402 | male,,0,0,11774,29.7,C47,C
403 | male,20,0,0,SOTON/O2 3101287,7.925,,S
404 | male,16,0,0,S.O./P.P. 3,10.5,,S
405 | female,30,0,0,113798,31,,C
406 | male,34.5,0,0,2683,6.4375,,C
407 | male,17,0,0,315090,8.6625,,S
408 | male,42,0,0,C.A. 5547,7.55,,S
409 | male,,8,2,CA. 2343,69.55,,S
410 | male,35,0,0,349213,7.8958,,C
411 | male,28,0,1,248727,33,,S
412 | female,,1,0,17453,89.1042,C92,C
413 | male,4,4,2,347082,31.275,,S
414 | male,74,0,0,347060,7.775,,S
415 | female,9,1,1,2678,15.2458,,C
416 | female,16,0,1,PC 17592,39.4,D28,S
417 | female,44,1,0,244252,26,,S
418 | female,18,0,1,392091,9.35,,S
419 | female,45,1,1,36928,164.8667,,S
420 | male,51,0,0,113055,26.55,E17,S
421 | female,24,0,3,2666,19.2583,,C
422 | male,,0,0,2629,7.2292,,C
423 | male,41,2,0,350026,14.1083,,S
424 | male,21,1,0,28134,11.5,,S
425 | female,48,0,0,17466,25.9292,D17,S
426 | female,,8,2,CA. 2343,69.55,,S
427 | male,24,0,0,233866,13,,S
428 | female,42,0,0,236852,13,,S
429 | female,27,1,0,SC/PARIS 2149,13.8583,,C
430 | male,31,0,0,PC 17590,50.4958,A24,S
431 | male,,0,0,345777,9.5,,S
432 | male,4,1,1,347742,11.1333,,S
433 | male,26,0,0,349248,7.8958,,S
434 | female,47,1,1,11751,52.5542,D35,S
435 | male,33,0,0,695,5,B51 B53 B55,S
436 | male,47,0,0,345765,9,,S
437 | female,28,1,0,P/PP 3381,24,,C
438 | female,15,0,0,2667,7.225,,C
439 | male,20,0,0,7534,9.8458,,S
440 | male,19,0,0,349212,7.8958,,S
441 | male,,0,0,349217,7.8958,,S
442 | female,56,0,1,11767,83.1583,C50,C
443 | female,25,0,1,230433,26,,S
444 | male,33,0,0,349257,7.8958,,S
445 | female,22,0,0,7552,10.5167,,S
446 | male,28,0,0,C.A./SOTON 34068,10.5,,S
447 | male,25,0,0,SOTON/OQ 392076,7.05,,S
448 | female,39,0,5,382652,29.125,,Q
449 | male,27,0,0,211536,13,,S
450 | female,19,0,0,112053,30,B42,S
451 | female,,1,2,W./C. 6607,23.45,,S
452 | male,26,0,0,111369,30,C148,C
453 | male,32,0,0,370376,7.75,,Q
--------------------------------------------------------------------------------
/第二章项目集合/data/train-right-up.csv:
--------------------------------------------------------------------------------
1 | Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2 | male,22,1,0,A/5 21171,7.25,,S
3 | female,38,1,0,PC 17599,71.2833,C85,C
4 | female,26,0,0,STON/O2. 3101282,7.925,,S
5 | female,35,1,0,113803,53.1,C123,S
6 | male,35,0,0,373450,8.05,,S
7 | male,,0,0,330877,8.4583,,Q
8 | male,54,0,0,17463,51.8625,E46,S
9 | male,2,3,1,349909,21.075,,S
10 | female,27,0,2,347742,11.1333,,S
11 | female,14,1,0,237736,30.0708,,C
12 | female,4,1,1,PP 9549,16.7,G6,S
13 | female,58,0,0,113783,26.55,C103,S
14 | male,20,0,0,A/5. 2151,8.05,,S
15 | male,39,1,5,347082,31.275,,S
16 | female,14,0,0,350406,7.8542,,S
17 | female,55,0,0,248706,16,,S
18 | male,2,4,1,382652,29.125,,Q
19 | male,,0,0,244373,13,,S
20 | female,31,1,0,345763,18,,S
21 | female,,0,0,2649,7.225,,C
22 | male,35,0,0,239865,26,,S
23 | male,34,0,0,248698,13,D56,S
24 | female,15,0,0,330923,8.0292,,Q
25 | male,28,0,0,113788,35.5,A6,S
26 | female,8,3,1,349909,21.075,,S
27 | female,38,1,5,347077,31.3875,,S
28 | male,,0,0,2631,7.225,,C
29 | male,19,3,2,19950,263,C23 C25 C27,S
30 | female,,0,0,330959,7.8792,,Q
31 | male,,0,0,349216,7.8958,,S
32 | male,40,0,0,PC 17601,27.7208,,C
33 | female,,1,0,PC 17569,146.5208,B78,C
34 | female,,0,0,335677,7.75,,Q
35 | male,66,0,0,C.A. 24579,10.5,,S
36 | male,28,1,0,PC 17604,82.1708,,C
37 | male,42,1,0,113789,52,,S
38 | male,,0,0,2677,7.2292,,C
39 | male,21,0,0,A./5. 2152,8.05,,S
40 | female,18,2,0,345764,18,,S
41 | female,14,1,0,2651,11.2417,,C
42 | female,40,1,0,7546,9.475,,S
43 | female,27,1,0,11668,21,,S
44 | male,,0,0,349253,7.8958,,C
45 | female,3,1,2,SC/Paris 2123,41.5792,,C
46 | female,19,0,0,330958,7.8792,,Q
47 | male,,0,0,S.C./A.4. 23567,8.05,,S
48 | male,,1,0,370371,15.5,,Q
49 | female,,0,0,14311,7.75,,Q
50 | male,,2,0,2662,21.6792,,C
51 | female,18,1,0,349237,17.8,,S
52 | male,7,4,1,3101295,39.6875,,S
53 | male,21,0,0,A/4. 39886,7.8,,S
54 | female,49,1,0,PC 17572,76.7292,D33,C
55 | female,29,1,0,2926,26,,S
56 | male,65,0,1,113509,61.9792,B30,C
57 | male,,0,0,19947,35.5,C52,S
58 | female,21,0,0,C.A. 31026,10.5,,S
59 | male,28.5,0,0,2697,7.2292,,C
60 | female,5,1,2,C.A. 34651,27.75,,S
61 | male,11,5,2,CA 2144,46.9,,S
62 | male,22,0,0,2669,7.2292,,C
63 | female,38,0,0,113572,80,B28,
64 | male,45,1,0,36973,83.475,C83,S
65 | male,4,3,2,347088,27.9,,S
66 | male,,0,0,PC 17605,27.7208,,C
67 | male,,1,1,2661,15.2458,,C
68 | female,29,0,0,C.A. 29395,10.5,F33,S
69 | male,19,0,0,S.P. 3464,8.1583,,S
70 | female,17,4,2,3101281,7.925,,S
71 | male,26,2,0,315151,8.6625,,S
72 | male,32,0,0,C.A. 33111,10.5,,S
73 | female,16,5,2,CA 2144,46.9,,S
74 | male,21,0,0,S.O.C. 14879,73.5,,S
75 | male,26,1,0,2680,14.4542,,C
76 | male,32,0,0,1601,56.4958,,S
77 | male,25,0,0,348123,7.65,F G73,S
78 | male,,0,0,349208,7.8958,,S
79 | male,,0,0,374746,8.05,,S
80 | male,0.83,0,2,248738,29,,S
81 | female,30,0,0,364516,12.475,,S
82 | male,22,0,0,345767,9,,S
83 | male,29,0,0,345779,9.5,,S
84 | female,,0,0,330932,7.7875,,Q
85 | male,28,0,0,113059,47.1,,S
86 | female,17,0,0,SO/C 14885,10.5,,S
87 | female,33,3,0,3101278,15.85,,S
88 | male,16,1,3,W./C. 6608,34.375,,S
89 | male,,0,0,SOTON/OQ 392086,8.05,,S
90 | female,23,3,2,19950,263,C23 C25 C27,S
91 | male,24,0,0,343275,8.05,,S
92 | male,29,0,0,343276,8.05,,S
93 | male,20,0,0,347466,7.8542,,S
94 | male,46,1,0,W.E.P. 5734,61.175,E31,S
95 | male,26,1,2,C.A. 2315,20.575,,S
96 | male,59,0,0,364500,7.25,,S
97 | male,,0,0,374910,8.05,,S
98 | male,71,0,0,PC 17754,34.6542,A5,C
99 | male,23,0,1,PC 17759,63.3583,D10 D12,C
100 | female,34,0,1,231919,23,,S
101 | male,34,1,0,244367,26,,S
102 | female,28,0,0,349245,7.8958,,S
103 | male,,0,0,349215,7.8958,,S
104 | male,21,0,1,35281,77.2875,D26,S
105 | male,33,0,0,7540,8.6542,,S
106 | male,37,2,0,3101276,7.925,,S
107 | male,28,0,0,349207,7.8958,,S
108 | female,21,0,0,343120,7.65,,S
109 | male,,0,0,312991,7.775,,S
110 | male,38,0,0,349249,7.8958,,S
111 | female,,1,0,371110,24.15,,Q
112 | male,47,0,0,110465,52,C110,S
113 | female,14.5,1,0,2665,14.4542,,C
114 | male,22,0,0,324669,8.05,,S
115 | female,20,1,0,4136,9.825,,S
116 | female,17,0,0,2627,14.4583,,C
117 | male,21,0,0,STON/O 2. 3101294,7.925,,S
118 | male,70.5,0,0,370369,7.75,,Q
119 | male,29,1,0,11668,21,,S
120 | male,24,0,1,PC 17558,247.5208,B58 B60,C
121 | female,2,4,2,347082,31.275,,S
122 | male,21,2,0,S.O.C. 14879,73.5,,S
123 | male,,0,0,A4. 54510,8.05,,S
124 | male,32.5,1,0,237736,30.0708,,C
125 | female,32.5,0,0,27267,13,E101,S
126 | male,54,0,1,35281,77.2875,D26,S
127 | male,12,1,0,2651,11.2417,,C
128 | male,,0,0,370372,7.75,,Q
129 | male,24,0,0,C 17369,7.1417,,S
130 | female,,1,1,2668,22.3583,F E69,C
131 | male,45,0,0,347061,6.975,,S
132 | male,33,0,0,349241,7.8958,,C
133 | male,20,0,0,SOTON/O.Q. 3101307,7.05,,S
134 | female,47,1,0,A/5. 3337,14.5,,S
135 | female,29,1,0,228414,26,,S
136 | male,25,0,0,C.A. 29178,13,,S
137 | male,23,0,0,SC/PARIS 2133,15.0458,,C
138 | female,19,0,2,11752,26.2833,D47,S
139 | male,37,1,0,113803,53.1,C123,S
140 | male,16,0,0,7534,9.2167,,S
141 | male,24,0,0,PC 17593,79.2,B86,C
142 | female,,0,2,2678,15.2458,,C
143 | female,22,0,0,347081,7.75,,S
144 | female,24,1,0,STON/O2. 3101279,15.85,,S
145 | male,19,0,0,365222,6.75,,Q
146 | male,18,0,0,231945,11.5,,S
147 | male,19,1,1,C.A. 33112,36.75,,S
148 | male,27,0,0,350043,7.7958,,S
149 | female,9,2,2,W./C. 6608,34.375,,S
150 | male,36.5,0,2,230080,26,F2,S
151 | male,42,0,0,244310,13,,S
152 | male,51,0,0,S.O.P. 1166,12.525,,S
153 | female,22,1,0,113776,66.6,C2,S
154 | male,55.5,0,0,A.5. 11206,8.05,,S
155 | male,40.5,0,2,A/5. 851,14.5,,S
156 | male,,0,0,Fa 265302,7.3125,,S
157 | male,51,0,1,PC 17597,61.3792,,C
158 | female,16,0,0,35851,7.7333,,Q
159 | male,30,0,0,SOTON/OQ 392090,8.05,,S
160 | male,,0,0,315037,8.6625,,S
161 | male,,8,2,CA. 2343,69.55,,S
162 | male,44,0,1,371362,16.1,,S
163 | female,40,0,0,C.A. 33595,15.75,,S
164 | male,26,0,0,347068,7.775,,S
165 | male,17,0,0,315093,8.6625,,S
166 | male,1,4,1,3101295,39.6875,,S
167 | male,9,0,2,363291,20.525,,S
168 | female,,0,1,113505,55,E33,S
169 | female,45,1,4,347088,27.9,,S
170 | male,,0,0,PC 17318,25.925,,S
171 | male,28,0,0,1601,56.4958,,S
172 | male,61,0,0,111240,33.5,B19,S
173 | male,4,4,1,382652,29.125,,Q
174 | female,1,1,1,347742,11.1333,,S
175 | male,21,0,0,STON/O 2. 3101280,7.925,,S
176 | male,56,0,0,17764,30.6958,A7,C
177 | male,18,1,1,350404,7.8542,,S
178 | male,,3,1,4133,25.4667,,S
179 | female,50,0,0,PC 17595,28.7125,C49,C
180 | male,30,0,0,250653,13,,S
181 | male,36,0,0,LINE,0,,S
182 | female,,8,2,CA. 2343,69.55,,S
183 | male,,0,0,SC/PARIS 2131,15.05,,C
184 | male,9,4,2,347077,31.3875,,S
185 | male,1,2,1,230136,39,F4,S
186 | female,4,0,2,315153,22.025,,S
187 | male,,0,0,113767,50,A32,S
188 | female,,1,0,370365,15.5,,Q
189 | male,45,0,0,111428,26.55,,S
190 | male,40,1,1,364849,15.5,,Q
191 | male,36,0,0,349247,7.8958,,S
192 | female,32,0,0,234604,13,,S
193 | male,19,0,0,28424,13,,S
194 | female,19,1,0,350046,7.8542,,S
195 | male,3,1,1,230080,26,F2,S
196 | female,44,0,0,PC 17610,27.7208,B4,C
197 | female,58,0,0,PC 17569,146.5208,B80,C
198 | male,,0,0,368703,7.75,,Q
199 | male,42,0,1,4579,8.4042,,S
200 | female,,0,0,370370,7.75,,Q
201 | female,24,0,0,248747,13,,S
202 | male,28,0,0,345770,9.5,,S
203 | male,,8,2,CA. 2343,69.55,,S
204 | male,34,0,0,3101264,6.4958,,S
205 | male,45.5,0,0,2628,7.225,,C
206 | male,18,0,0,A/5 3540,8.05,,S
207 | female,2,0,1,347054,10.4625,G6,S
208 | male,32,1,0,3101278,15.85,,S
209 | male,26,0,0,2699,18.7875,,C
210 | female,16,0,0,367231,7.75,,Q
211 | male,40,0,0,112277,31,A31,C
212 | male,24,0,0,SOTON/O.Q. 3101311,7.05,,S
213 | female,35,0,0,F.C.C. 13528,21,,S
214 | male,22,0,0,A/5 21174,7.25,,S
215 | male,30,0,0,250646,13,,S
216 | male,,1,0,367229,7.75,,Q
217 | female,31,1,0,35273,113.275,D36,C
218 | female,27,0,0,STON/O2. 3101283,7.925,,S
219 | male,42,1,0,243847,27,,S
220 | female,32,0,0,11813,76.2917,D15,C
221 | male,30,0,0,W/C 14208,10.5,,S
222 | male,16,0,0,SOTON/OQ 392089,8.05,,S
223 | male,27,0,0,220367,13,,S
224 | male,51,0,0,21440,8.05,,S
225 | male,,0,0,349234,7.8958,,S
226 | male,38,1,0,19943,90,C93,S
227 | male,22,0,0,PP 4348,9.35,,S
228 | male,19,0,0,SW/PP 751,10.5,,S
229 | male,20.5,0,0,A/5 21173,7.25,,S
230 | male,18,0,0,236171,13,,S
231 | female,,3,1,4133,25.4667,,S
232 | female,35,1,0,36973,83.475,C83,S
233 | male,29,0,0,347067,7.775,,S
234 | male,59,0,0,237442,13.5,,S
235 | female,5,4,2,347077,31.3875,,S
236 | male,24,0,0,C.A. 29566,10.5,,S
237 | female,,0,0,W./C. 6609,7.55,,S
238 | male,44,1,0,26707,26,,S
239 | female,8,0,2,C.A. 31921,26.25,,S
240 | male,19,0,0,28665,10.5,,S
241 | male,33,0,0,SCO/W 1585,12.275,,S
242 | female,,1,0,2665,14.4542,,C
243 | female,,1,0,367230,15.5,,Q
244 | male,29,0,0,W./C. 14263,10.5,,S
245 | male,22,0,0,STON/O 2. 3101275,7.125,,S
246 | male,30,0,0,2694,7.225,,C
247 | male,44,2,0,19928,90,C78,Q
248 | female,25,0,0,347071,7.775,,S
249 | female,24,0,2,250649,14.5,,S
250 | male,37,1,1,11751,52.5542,D35,S
251 | male,54,1,0,244252,26,,S
252 | male,,0,0,362316,7.25,,S
253 | female,29,1,1,347054,10.4625,G6,S
254 | male,62,0,0,113514,26.55,C87,S
255 | male,30,1,0,A/5. 3336,16.1,,S
256 | female,41,0,2,370129,20.2125,,S
257 | female,29,0,2,2650,15.2458,,C
258 | female,,0,0,PC 17585,79.2,,C
259 | female,30,0,0,110152,86.5,B77,S
260 | female,35,0,0,PC 17755,512.3292,,C
261 | female,50,0,1,230433,26,,S
262 | male,,0,0,384461,7.75,,Q
263 | male,3,4,2,347077,31.3875,,S
264 | male,52,1,1,110413,79.65,E67,S
265 | male,40,0,0,112059,0,B94,S
266 | female,,0,0,382649,7.75,,Q
267 | male,36,0,0,C.A. 17248,10.5,,S
268 | male,16,4,1,3101295,39.6875,,S
269 | male,25,1,0,347083,7.775,,S
270 | female,58,0,1,PC 17582,153.4625,C125,S
271 | female,35,0,0,PC 17760,135.6333,C99,S
272 | male,,0,0,113798,31,,S
273 | male,25,0,0,LINE,0,,S
274 | female,41,0,1,250644,19.5,,S
275 | male,37,0,1,PC 17596,29.7,C118,C
276 | female,,0,0,370375,7.75,,Q
277 | female,63,1,0,13502,77.9583,D7,S
278 | female,45,0,0,347073,7.75,,S
279 | male,,0,0,239853,0,,S
280 | male,7,4,1,382652,29.125,,Q
281 | female,35,1,1,C.A. 2673,20.25,,S
282 | male,65,0,0,336439,7.75,,Q
283 | male,28,0,0,347464,7.8542,,S
284 | male,16,0,0,345778,9.5,,S
285 | male,19,0,0,A/5. 10482,8.05,,S
286 | male,,0,0,113056,26,A19,S
287 | male,33,0,0,349239,8.6625,,C
288 | male,30,0,0,345774,9.5,,S
289 | male,22,0,0,349206,7.8958,,S
290 | male,42,0,0,237798,13,,S
291 | female,22,0,0,370373,7.75,,Q
292 | female,26,0,0,19877,78.85,,S
293 | female,19,1,0,11967,91.0792,B49,C
294 | male,36,0,0,SC/Paris 2163,12.875,D,C
295 | female,24,0,0,349236,8.85,,S
296 | male,24,0,0,349233,7.8958,,S
297 | male,,0,0,PC 17612,27.7208,,C
298 | male,23.5,0,0,2693,7.2292,,C
299 | female,2,1,2,113781,151.55,C22 C26,S
300 | male,,0,0,19988,30.5,C106,S
301 | female,50,0,1,PC 17558,247.5208,B58 B60,C
302 | female,,0,0,9234,7.75,,Q
303 | male,,2,0,367226,23.25,,Q
304 | male,19,0,0,LINE,0,,S
305 | female,,0,0,226593,12.35,E101,Q
306 | male,,0,0,A/5 2466,8.05,,S
307 | male,0.92,1,2,113781,151.55,C22 C26,S
308 | female,,0,0,17421,110.8833,,C
309 | female,17,1,0,PC 17758,108.9,C65,C
310 | male,30,1,0,P/PP 3381,24,,C
311 | female,30,0,0,PC 17485,56.9292,E36,C
312 | female,24,0,0,11767,83.1583,C54,C
313 | female,18,2,2,PC 17608,262.375,B57 B59 B63 B66,C
314 | female,26,1,1,250651,26,,S
315 | male,28,0,0,349243,7.8958,,S
316 | male,43,1,1,F.C.C. 13529,26.25,,S
317 | female,26,0,0,347470,7.8542,,S
318 | female,24,1,0,244367,26,,S
319 | male,54,0,0,29011,14,,S
320 | female,31,0,2,36928,164.8667,C7,S
321 | female,40,1,1,16966,134.5,E34,C
322 | male,22,0,0,A/5 21172,7.25,,S
323 | male,27,0,0,349219,7.8958,,S
324 | female,30,0,0,234818,12.35,,Q
325 | female,22,1,1,248738,29,,S
326 | male,,8,2,CA. 2343,69.55,,S
327 | female,36,0,0,PC 17760,135.6333,C32,C
328 | male,61,0,0,345364,6.2375,,S
329 | female,36,0,0,28551,13,D,S
330 | female,31,1,1,363291,20.525,,S
331 | female,16,0,1,111361,57.9792,B18,C
332 | female,,2,0,367226,23.25,,Q
333 | male,45.5,0,0,113043,28.5,C124,S
334 | male,38,0,1,PC 17582,153.4625,C91,S
335 | male,16,2,0,345764,18,,S
336 | female,,1,0,PC 17611,133.65,,S
337 | male,,0,0,349225,7.8958,,S
338 | male,29,1,0,113776,66.6,C2,S
339 | female,41,0,0,16966,134.5,E40,C
340 | male,45,0,0,7598,8.05,,S
341 | male,45,0,0,113784,35.5,T,S
342 | male,2,1,1,230080,26,F2,S
343 | female,24,3,2,19950,263,C23 C25 C27,S
344 | male,28,0,0,248740,13,,S
345 | male,25,0,0,244361,13,,S
346 | male,36,0,0,229236,13,,S
347 | female,24,0,0,248733,13,F33,S
348 | female,40,0,0,31418,13,,S
349 | female,,1,0,386525,16.1,,S
350 | male,3,1,1,C.A. 37671,15.9,,S
351 | male,42,0,0,315088,8.6625,,S
352 | male,23,0,0,7267,9.225,,S
353 | male,,0,0,113510,35,C128,S
354 | male,15,1,1,2695,7.2292,,C
355 | male,25,1,0,349237,17.8,,S
356 | male,,0,0,2647,7.225,,C
357 | male,28,0,0,345783,9.5,,S
358 | female,22,0,1,113505,55,E33,S
359 | female,38,0,0,237671,13,,S
360 | female,,0,0,330931,7.8792,,Q
361 | female,,0,0,330980,7.8792,,Q
362 | male,40,1,4,347088,27.9,,S
363 | male,29,1,0,SC/PARIS 2167,27.7208,,C
364 | female,45,0,1,2691,14.4542,,C
365 | male,35,0,0,SOTON/O.Q. 3101310,7.05,,S
366 | male,,1,0,370365,15.5,,Q
367 | male,30,0,0,C 7076,7.25,,S
368 | female,60,1,0,110813,75.25,D37,C
369 | female,,0,0,2626,7.2292,,C
370 | female,,0,0,14313,7.75,,Q
371 | female,24,0,0,PC 17477,69.3,B35,C
372 | male,25,1,0,11765,55.4417,E50,C
373 | male,18,1,0,3101267,6.4958,,S
374 | male,19,0,0,323951,8.05,,S
375 | male,22,0,0,PC 17760,135.6333,,C
376 | female,3,3,1,349909,21.075,,S
377 | female,,1,0,PC 17604,82.1708,,C
378 | female,22,0,0,C 7077,7.25,,S
379 | male,27,0,2,113503,211.5,C82,C
380 | male,20,0,0,2648,4.0125,,C
381 | male,19,0,0,347069,7.775,,S
382 | female,42,0,0,PC 17757,227.525,,C
383 | female,1,0,2,2653,15.7417,,C
384 | male,32,0,0,STON/O 2. 3101293,7.925,,S
385 | female,35,1,0,113789,52,,S
386 | male,,0,0,349227,7.8958,,S
387 | male,18,0,0,S.O.C. 14879,73.5,,S
388 | male,1,5,2,CA 2144,46.9,,S
389 | female,36,0,0,27849,13,,S
390 | male,,0,0,367655,7.7292,,Q
391 | female,17,0,0,SC 1748,12,,C
392 | male,36,1,2,113760,120,B96 B98,S
393 | male,21,0,0,350034,7.7958,,S
394 | male,28,2,0,3101277,7.925,,S
395 | female,23,1,0,35273,113.275,D36,C
396 | female,24,0,2,PP 9549,16.7,G6,S
397 | male,22,0,0,350052,7.7958,,S
398 | female,31,0,0,350407,7.8542,,S
399 | male,46,0,0,28403,26,,S
400 | male,23,0,0,244278,10.5,,S
401 | female,28,0,0,240929,12.65,,S
402 | male,39,0,0,STON/O 2. 3101289,7.925,,S
403 | male,26,0,0,341826,8.05,,S
404 | female,21,1,0,4137,9.825,,S
405 | male,28,1,0,STON/O2. 3101279,15.85,,S
406 | female,20,0,0,315096,8.6625,,S
407 | male,34,1,0,28664,21,,S
408 | male,51,0,0,347064,7.75,,S
409 | male,3,1,1,29106,18.75,,S
410 | male,21,0,0,312992,7.775,,S
411 | female,,3,1,4133,25.4667,,S
412 | male,,0,0,349222,7.8958,,S
413 | male,,0,0,394140,6.8583,,Q
414 | female,33,1,0,19928,90,C78,Q
415 | male,,0,0,239853,0,,S
416 | male,44,0,0,STON/O 2. 3101269,7.925,,S
417 | female,,0,0,343095,8.05,,S
418 | female,34,1,1,28220,32.5,,S
419 | female,18,0,2,250652,13,,S
420 | male,30,0,0,28228,13,,S
421 | female,10,0,2,345773,24.15,,S
422 | male,,0,0,349254,7.8958,,C
423 | male,21,0,0,A/5. 13032,7.7333,,Q
424 | male,29,0,0,315082,7.875,,S
425 | female,28,1,1,347080,14.4,,S
426 | male,18,1,1,370129,20.2125,,S
427 | male,,0,0,A/4. 34244,7.25,,S
428 | female,28,1,0,2003,26,,S
429 | female,19,0,0,250655,26,,S
430 | male,,0,0,364851,7.75,,Q
431 | male,32,0,0,SOTON/O.Q. 392078,8.05,E10,S
432 | male,28,0,0,110564,26.55,C52,S
433 | female,,1,0,376564,16.1,,S
434 | female,42,1,0,SC/AH 3085,26,,S
435 | male,17,0,0,STON/O 2. 3101274,7.125,,S
436 | male,50,1,0,13507,55.9,E44,S
437 | female,14,1,2,113760,120,B96 B98,S
438 | female,21,2,2,W./C. 6608,34.375,,S
439 | female,24,2,3,29106,18.75,,S
440 | male,64,1,4,19950,263,C23 C25 C27,S
--------------------------------------------------------------------------------
/第二章项目集合/sex_fare_survived.csv:
--------------------------------------------------------------------------------
1 | Sex,Fare,Survived
2 | female,44.47981783439487,233
3 | male,25.523893414211418,109
4 |
--------------------------------------------------------------------------------
/第二章项目集合/unit_result.csv:
--------------------------------------------------------------------------------
1 | ,,0
2 | 0,Unnamed: 0,0
3 | 0,PassengerId,1
4 | 0,Survived,0
5 | 0,Pclass,3
6 | 0,Name,"Braund, Mr. Owen Harris"
7 | 0,Sex,male
8 | 0,Age,22.0
9 | 0,SibSp,1.0
10 | 0,Parch,0.0
11 | 0,Ticket,A/5 21171
12 | 0,Fare,7.25
13 | 0,Embarked,S
14 | 1,Unnamed: 0,1
15 | 1,PassengerId,2
16 | 1,Survived,1
17 | 1,Pclass,1
18 | 1,Name,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
19 | 1,Sex,female
20 | 1,Age,38.0
21 | 1,SibSp,1.0
22 |
--------------------------------------------------------------------------------
/第二章项目集合/vis_data/sex_fare_survived.csv:
--------------------------------------------------------------------------------
1 | Sex,Fare,Survived
2 | female,44.47981783439487,233
3 | male,25.523893414211418,109
4 |
--------------------------------------------------------------------------------
/第二章项目集合/vis_data/unit_result.csv:
--------------------------------------------------------------------------------
1 | 0,Unnamed: 0,0
2 | 0,PassengerId,1
3 | 0,Survived,0
4 | 0,Pclass,3
5 | 0,Name,"Braund, Mr. Owen Harris"
6 | 0,Sex,male
7 | 0,Age,22.0
8 | 0,SibSp,1.0
9 | 0,Parch,0.0
10 | 0,Ticket,A/5 21171
11 | 0,Fare,7.25
12 | 0,Embarked,S
13 | 1,Unnamed: 0,1
14 | 1,PassengerId,2
15 | 1,Survived,1
16 | 1,Pclass,1
17 | 1,Name,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
18 | 1,Sex,female
19 | 1,Age,38.0
20 | 1,SibSp,1.0
21 |
--------------------------------------------------------------------------------
/第二章项目集合/第二章:第一节数据清洗及特征处理-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "#加载所需的库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "#加载数据train.csv\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## 2 第二章:数据清洗及特征处理\n",
40 | "我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "### 2.1 缺失值观察与处理\n",
48 | "我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "#### 2.1.1 任务一:缺失值观察\n",
56 | "(1) 请查看每个特征缺失值个数 \n",
57 | "(2) 请查看Age, Cabin, Embarked列的数据\n",
58 | "以上方式都有多种方式,所以大家多多益善"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "#写入代码\n",
68 | "\n",
69 | "\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "#写入代码\n",
79 | "\n",
80 | "\n"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "#写入代码\n",
90 | "\n",
91 | "\n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "#### 2.1.2 任务二:对缺失值进行处理\n",
99 | "(1)处理缺失值一般有几种思路\n",
100 | "\n",
101 | "(2) 请尝试对Age列的数据的缺失值进行处理\n",
102 | "\n",
103 | "(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理 \n"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "#处理缺失值的一般思路:\n",
113 | "#提醒:可使用的函数有--->dropna函数与fillna函数\n",
114 | "\n",
115 | "\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {},
122 | "outputs": [],
123 | "source": [
124 | "#写入代码\n",
125 | "\n",
126 | "\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": [
135 | "#写入代码\n",
136 | "\n",
137 | "\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "#写入代码\n",
147 | "\n",
148 | "\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "【思考1】dropna和fillna有哪些参数,分别如何使用呢? "
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "【思考】检索空缺值用`np.nan`,`None`以及`.isnull()`哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 1,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#思考回答\n",
172 | "\n",
173 | "\n"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "### 2.2 重复值观察与处理\n",
195 | "由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "#### 2.2.1 任务一:请查看数据中的重复值"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "#写入代码\n",
212 | "\n",
213 | "\n"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "#### 2.2.2 任务二:对重复值进行处理\n",
221 | "(1)重复值有哪些处理方式呢?\n",
222 | "\n",
223 | "(2)处理我们数据的重复值\n",
224 | "\n",
225 | "方法多多益善"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "#重复值有哪些处理方式:\n",
235 | "\n",
236 | "\n"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "#写入代码\n",
246 | "\n",
247 | "\n"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "#### 2.2.3 任务三:将前面清洗的数据保存为csv格式"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "#写入代码\n",
264 | "\n",
265 | "\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "### 2.3 特征观察与处理\n",
273 | "我们对特征进行一下观察,可以把特征大概分为两大类: \n",
274 | "数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征 \n",
275 | "文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "#### 2.3.1 任务一:对年龄进行分箱(离散化)处理\n",
283 | "(1) 分箱操作是什么?\n",
284 | "\n",
285 | "(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示 \n",
286 | "\n",
287 | "(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示 \n",
288 | "\n",
289 | "(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示\n",
290 | "\n",
291 | "(5) 将上面的获得的数据分别进行保存,保存为csv格式"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "#分箱操作是什么:\n",
301 | "\n",
302 | "\n"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "#写入代码\n",
312 | "\n",
313 | "\n"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "#写入代码\n",
323 | "\n",
324 | "\n"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {},
331 | "outputs": [],
332 | "source": [
333 | "#写入代码\n",
334 | "\n",
335 | "\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html"
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "#### 2.3.2 任务二:对文本变量进行转换\n",
357 | "(1) 查看文本变量名及种类 \n",
358 | "(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示 \n",
359 | "(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "#写入代码\n",
369 | "\n",
370 | "\n"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": null,
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "#写入代码\n",
380 | "\n",
381 | "\n"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": null,
387 | "metadata": {},
388 | "outputs": [],
389 | "source": [
390 | "#写入代码\n",
391 | "\n",
392 | "\n"
393 | ]
394 | },
395 | {
396 | "cell_type": "markdown",
397 | "metadata": {},
398 | "source": [
399 | "#### 2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": null,
405 | "metadata": {},
406 | "outputs": [],
407 | "source": [
408 | "#写入代码\n",
409 | "\n",
410 | "\n"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {},
417 | "outputs": [],
418 | "source": [
419 | "#保存最终你完成的已经清理好的数据\n"
420 | ]
421 | }
422 | ],
423 | "metadata": {
424 | "kernelspec": {
425 | "display_name": "Python 3",
426 | "language": "python",
427 | "name": "python3"
428 | },
429 | "language_info": {
430 | "codemirror_mode": {
431 | "name": "ipython",
432 | "version": 3
433 | },
434 | "file_extension": ".py",
435 | "mimetype": "text/x-python",
436 | "name": "python",
437 | "nbconvert_exporter": "python",
438 | "pygments_lexer": "ipython3",
439 | "version": "3.6.9"
440 | },
441 | "toc": {
442 | "base_numbering": 1,
443 | "nav_menu": {},
444 | "number_sections": false,
445 | "sideBar": true,
446 | "skip_h1_title": false,
447 | "title_cell": "Table of Contents",
448 | "title_sidebar": "Contents",
449 | "toc_cell": false,
450 | "toc_position": {
451 | "height": "calc(100% - 180px)",
452 | "left": "10px",
453 | "top": "150px",
454 | "width": "433px"
455 | },
456 | "toc_section_display": true,
457 | "toc_window_display": true
458 | }
459 | },
460 | "nbformat": 4,
461 | "nbformat_minor": 4
462 | }
463 |
--------------------------------------------------------------------------------
/第二章项目集合/第二章:第三节数据重构2-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了**数据的清洗**,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 3,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# 导入基本库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 载入上一个任务人保存的文件中:result.csv,并查看这个文件\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# 2 第二章:数据重构\n"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## 第一部分:数据聚合与运算"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "### 2.6 数据运用"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "#### 2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 5,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "#写入心得\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "#### 2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 2,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# 写入代码\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。\n",
93 | "\n",
94 | "下面通过几个任务来熟悉GroupBy机制。"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "#### 2.4.3:任务三:统计泰坦尼克号中男女的存活人数"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 2,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "# 写入代码\n"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "#### 2.4.4:任务四:计算客舱不同等级的存活人数"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 2,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "# 写入代码\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "【**提示:**】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "【**思考**】从数据分析的角度,上面的统计结果可以得出那些结论"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 9,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "#思考心得 \n",
150 | "\n"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 1,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "#思考心得\n",
167 | "\n",
168 | "\n"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "#### 2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 2,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": [
184 | "# 写入代码\n"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "#### 2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 2,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "# 写入代码\n"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "#### 2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)\n"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 2,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "# 写入代码\n"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 2,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "# 写入代码\n"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 2,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# 写入代码\n"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 2,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "# 写入代码\n"
244 | ]
245 | }
246 | ],
247 | "metadata": {
248 | "kernelspec": {
249 | "display_name": "Python 3",
250 | "language": "python",
251 | "name": "python3"
252 | },
253 | "language_info": {
254 | "codemirror_mode": {
255 | "name": "ipython",
256 | "version": 3
257 | },
258 | "file_extension": ".py",
259 | "mimetype": "text/x-python",
260 | "name": "python",
261 | "nbconvert_exporter": "python",
262 | "pygments_lexer": "ipython3",
263 | "version": "3.6.12"
264 | },
265 | "toc": {
266 | "base_numbering": 1,
267 | "nav_menu": {},
268 | "number_sections": false,
269 | "sideBar": true,
270 | "skip_h1_title": false,
271 | "title_cell": "Table of Contents",
272 | "title_sidebar": "Contents",
273 | "toc_cell": false,
274 | "toc_position": {
275 | "height": "calc(100% - 180px)",
276 | "left": "10px",
277 | "top": "150px",
278 | "width": "582px"
279 | },
280 | "toc_section_display": true,
281 | "toc_window_display": true
282 | }
283 | },
284 | "nbformat": 4,
285 | "nbformat_minor": 4
286 | }
287 |
--------------------------------------------------------------------------------
/第二章项目集合/第二章:第二节数据重构1-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了**数据的清洗**,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### 开始之前,导入numpy、pandas包和数据"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 2,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# 导入基本库\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# 载入data文件中的:train-left-up.csv\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# 2 第二章:数据重构\n"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "### 2.4 数据的合并"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "#### 2.4.1 任务一:将data文件夹里面的所有数据都载入,观察数据的之间的关系"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "#写入代码\n",
63 | "\n"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "#写入代码\n",
73 | "\n"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "#### 2.4.2:任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "#写入代码\n",
97 | "\n"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "#### 2.4.3 任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "#写入代码\n",
114 | "\n"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "#### 2.4.4 任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "#写入代码\n",
131 | "\n"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "#### 2.4.5 任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "#写入代码\n",
148 | "\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "#### 2.4.6 任务六:完成的数据保存为result.csv"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "#写入代码\n",
172 | "\n"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "### 2.5 换一种角度看数据"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "#### 2.5.1 任务一:将我们的数据变为Series类型的数据"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "#写入代码\n",
196 | "\n"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "#写入代码\n",
206 | "\n"
207 | ]
208 | }
209 | ],
210 | "metadata": {
211 | "kernelspec": {
212 | "display_name": "Python 3",
213 | "language": "python",
214 | "name": "python3"
215 | },
216 | "language_info": {
217 | "codemirror_mode": {
218 | "name": "ipython",
219 | "version": 3
220 | },
221 | "file_extension": ".py",
222 | "mimetype": "text/x-python",
223 | "name": "python",
224 | "nbconvert_exporter": "python",
225 | "pygments_lexer": "ipython3",
226 | "version": "3.7.3"
227 | },
228 | "toc": {
229 | "base_numbering": 1,
230 | "nav_menu": {},
231 | "number_sections": false,
232 | "sideBar": true,
233 | "skip_h1_title": false,
234 | "title_cell": "Table of Contents",
235 | "title_sidebar": "Contents",
236 | "toc_cell": false,
237 | "toc_position": {
238 | "height": "calc(100% - 180px)",
239 | "left": "10px",
240 | "top": "150px",
241 | "width": "582px"
242 | },
243 | "toc_section_display": true,
244 | "toc_window_display": true
245 | }
246 | },
247 | "nbformat": 4,
248 | "nbformat_minor": 2
249 | }
250 |
--------------------------------------------------------------------------------
/第二章项目集合/第二章:第四节数据可视化-课程.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**复习:**回顾学习完第一章,我们对泰坦尼克号数据有了基本的了解,也学到了一些基本的统计方法,第二章中我们学习了数据的清理和重构,使得数据更加的易于理解;今天我们要学习的是第二章第三节:**数据可视化**,主要给大家介绍一下Python数据可视化库Matplotlib,在本章学习中,你也许会觉得数据很有趣。在打比赛的过程中,数据可视化可以让我们更好的看到每一个关键步骤的结果如何,可以用来优化方案,是一个很有用的技巧。"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# 2 第二章:数据可视化"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "#### 开始之前,导入numpy、pandas以及matplotlib包和数据"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "# 加载所需的库\n",
31 | "# 如果出现 ModuleNotFoundError: No module named 'xxxx'\n",
32 | "# 你只需要在终端/cmd下 pip install xxxx 即可\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "#加载result.csv这个数据\n"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "### 2.7 如何让人一眼看懂你的数据?\n",
49 | "《Python for Data Analysis》第九章"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "#### 2.7.1 任务一:跟着书本第九章,了解matplotlib,自己创建一个数据项,对其进行基本可视化"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "【思考】最基本的可视化图案有哪些?分别适用于那些场景?(比如折线图适合可视化某个属性值随时间变化的走势)"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "#思考回答\n",
73 | "#这一部分需要了解可视化图案的的逻辑,知道什么样的图案可以表达什么样的信号b\n"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "#### 2.7.2 任务二:可视化展示泰坦尼克号数据集中男女中生存人数分布情况(用柱状图试试)。"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 4,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "#代码编写\n",
90 | "\n"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "【思考】计算出泰坦尼克号数据集中男女中死亡人数,并可视化展示?如何和男女生存人数可视化柱状图结合到一起?看到你的数据可视化,说说你的第一感受(比如:你一眼看出男生存活人数更多,那么性别可能会影响存活率)。"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "#思考题回答\n",
107 | "\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "#### 2.7.3 任务三:可视化展示泰坦尼克号数据集中男女中生存人与死亡人数的比例图(用柱状图试试)。"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 7,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "#代码编写\n",
124 | "# 提示:计算男女中死亡人数 1表示生存,0表示死亡\n",
125 | "\n",
126 | "\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "【提示】男女这两个数据轴,存活和死亡人数按比例用柱状图表示"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "#### 2.7.4 任务四:可视化展示泰坦尼克号数据集中不同票价的人生存和死亡人数分布情况。(用折线图试试)(横轴是不同票价,纵轴是存活人数)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "【提示】对于这种统计性质的且用折线表示的数据,你可以考虑将数据排序或者不排序来分别表示。看看你能发现什么?"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 4,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "#代码编写\n",
157 | "# 计算不同票价中生存与死亡人数 1表示生存,0表示死亡\n",
158 | "\n",
159 | "\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {
165 | "scrolled": true
166 | },
167 | "source": [
168 | "#### 2.7.5 任务五:可视化展示泰坦尼克号数据集中不同仓位等级的人生存和死亡人员的分布情况。(用柱状图试试)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 8,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "#代码编写\n",
178 | "# 1表示生存,0表示死亡\n",
179 | "\n",
180 | "\n"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "【思考】看到这个前面几个数据可视化,说说你的第一感受和你的总结"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": null,
193 | "metadata": {},
194 | "outputs": [],
195 | "source": [
196 | "#思考题回答\n",
197 | "\n"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "#### 2.7.6 任务六:可视化展示泰坦尼克号数据集中不同年龄的人生存与死亡人数分布情况。(不限表达方式)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 4,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "#代码编写\n",
214 | "\n"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "#### 2.7.7 任务七:可视化展示泰坦尼克号数据集中不同仓位等级的人年龄分布情况。(用折线图试试)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 4,
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "#代码编写\n",
231 | "\n"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "【思考】上面所有可视化的例子做一个总体的分析,你看看你能不能有自己发现"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {},
245 | "outputs": [],
246 | "source": [
247 | "#思考题回答\n",
248 | "\n"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "【总结】到这里,我们的可视化就告一段落啦,如果你对数据可视化极其感兴趣,你还可以了解一下其他可视化模块,如:pyecharts,bokeh等。\n",
256 | "\n",
257 | "如果你在工作中使用数据可视化,你必须知道数据可视化最大的作用不是炫酷,而是最快最直观的理解数据要表达什么,你觉得呢?"
258 | ]
259 | }
260 | ],
261 | "metadata": {
262 | "kernelspec": {
263 | "display_name": "Python 3",
264 | "language": "python",
265 | "name": "python3"
266 | },
267 | "language_info": {
268 | "codemirror_mode": {
269 | "name": "ipython",
270 | "version": 3
271 | },
272 | "file_extension": ".py",
273 | "mimetype": "text/x-python",
274 | "name": "python",
275 | "nbconvert_exporter": "python",
276 | "pygments_lexer": "ipython3",
277 | "version": "3.7.3"
278 | }
279 | },
280 | "nbformat": 4,
281 | "nbformat_minor": 2
282 | }
283 |
--------------------------------------------------------------------------------