『 Spark 』5. 这些年,你不能错过的 spark 学习资源
原文链接 https://litaotao.github.io/spark-resouces-blogs-paper
注:以下为加速网络访问所做的原文缓存,经过重新格式化,可能存在格式方面的问题,或偶有遗漏信息,请以原文为准。
写在前面
本系列是综合了自己在学习spark过程中的理解记录 + 对参考文章中的一些理解 + 个人实践spark过程中的一些心得而来。写这样一个系列仅仅是为了梳理个人学习spark的笔记记录,所以一切以能够理解为主,没有必要的细节就不会记录了,而且文中有时候会出现英文原版文档,只要不影响理解,都不翻译了。若想深入了解,最好阅读参考文章和官方文档。
其次,本系列是基于目前最新的 spark 1.6.0 系列开始的,spark 目前的更新速度很快,记录一下版本号还是必要的。
最后,如果各位觉得内容有误,欢迎留言备注,所有留言 24 小时内必定回复,非常感谢。
Tips: 如果插图看起来不明显,可以:1. 放大网页;2. 新标签中打开图片,查看原图哦;3. 点击右边目录上方的 present mode 哦。
1. 书籍,在线文档
- Learning Spark
- Advanced.Analytics.with.Spark
- Mastering Apache Spark
- Official Guide
- Spark Guide by Cloudera
2. 网站
- official site
- user mailing list
- spark channel on youtube
- spark summit
- spark technology center
- sparkhub
- meetup
- spark third party packages
- databricks blog
- http://blog.madhukaraphatak.com/
- databricks docs
- databricks training
- cloudera blog about spark
- https://0x0fff.com
- http://techsuppdiva.github.io/
- csdn spark 知识库
- 过往记忆
3. Databricks Blog
- Apache Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs
- Achieving End-to-end Security for Apache Spark with Databricks
上面两篇是 databricks 出的关于 databricks 专业版的描述,虽然没有从根本上解决问题,但是读起来还是挺有说服力的,哈哈,因为采用了很多很细节的方案。不错不错,各位有在做云产品的,在宣传自己的安全方案时可用参考参考哦。
- Deep Dive into Spark SQL’s Catalyst Optimizer
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
- Understanding your Apache Spark application through visualization
- New Visualizations for Understanding Apache Spark Streaming Applications
- An Introduction to Writing Apache Spark Applications on Databricks
- A Gentle Introduction to Apache Spark on Databricks
- Apache Spark on Databricks for Data Scientists
- Import Notebook Apache Spark on Databricks for Data Engineers
- Structured Streaming In Apache Spark: A new high-level API for streaming
- Spark SQL Supported Syntax
- Combining Machine Learning Frameworks with Apache Spark
- Deep Dive: Memory Management in Apache Spark
- https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
- https://spark-summit.org/east-2017/events/improving-python-and-spark-performance-and-interoperability/
- http://www.slideshare.net/wesm/high-performance-python-on-apache-spark
- https://www.slideshare.net/SparkSummit/spark-and-online-analytics-spark-summit-east-talky-by-shubham-chopra
- https://www.slideshare.net/SparkSummit/spark-parquet-in-depth-spark-summit-east-talk-by-emily-curtin-and-robbie-strickland
- https://www.slideshare.net/databricks/sparksql-a-compiler-from-queries-to-rdds
4. 文章,博客
- RDD论文英文版
- RDD论文中文版
- An Architecture for Fast and General Data Processing on Large Clusters
- How-to: Tune Your Apache Spark Jobs (Part 1)
- How-to: Tune Your Apache Spark Jobs (Part 2)
By 0x0fff
Spark MisconceptionsBy 0x0fff
Spark ArchitectureBy 0x0fff
Spark DataFrames are faster, aren’t they?By 0x0fff
Spark Architecture: ShuffleBy 0x0fff
Modern Data ArchitectureBy 0x0fff
Spark Architecture TalkBy 0x0fff
Apache Spark FutureBy 0x0fff
Data Industry TrendsBy 0x0fff
Spark Memory ManagementBy 0x0fff
Spark Architecture Video- 借助 Redis ,让 Spark 提速 45 倍!
- 量化派基于Hadoop、Spark、Storm的大数据风控架构
- 基于Spark的异构分布式深度学习平台
- 你对Hadoop和Spark生态圈了解有几许?
- Hadoop vs Spark
- 雅虎开源CaffeOnSpark:基于Hadoop/Spark的分布式深度学习
- 2016 上海第二次 spark meetup: 1. spark_meetup.pdf
- 2016 上海第二次 spark meetup: 2. Flink_ An unified stream engine.pdf
- 2016 上海第二次 spark meetup: 3. Spark在计算广告领域的应用实践.pdf
- 2016 上海第二次 spark meetup: 4. splunk_spark.pdf
- 基于Spark的医疗和金融大数据
- Monitoring Spark with Graphite and Grafana
- Spark使用CombineTextInputFormat缓解小文件过多导致Task数目过多的问题
- Databricks Empowers Enterprises to Secure Their Apache Spark Workloads
- Spark配置参数
- Running Spark Python Applications
- How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs
- Apache Spark’s Hidden REST API
- [SQL, sqlContext, hiveContext]
- [Spark Memory Issues]
- http://www.cnblogs.com/wrencai/p/4231934.html
- http://stackoverflow.com/questions/32349611/what-should-be-the-optimal-value-for-spark-sql-shuffle-partitions-or-how-do-we-i
- http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space
- A Beginner's Guide on Troubleshooting Spark Applications
- http://www.mincoder.com/article/2381.shtml
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
- http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
- https://spark.apache.org/docs/latest/job-scheduling.html
- https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
- http://blog.csdn.net/u012684933/article/details/50378725
5. 视频
- YouTube: what is apache spark
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
slide
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)Building, Debugging, and Tuning Spark Machine Learning Pipelines - Joseph Bradley (Databricks)
slide
Building, Debugging, and Tuning Spark Machine Learning PipelinesSpark DataFrames Simple and Fast Analysis of Structured Data - Michael Armbrust (Databricks)
slide
Spark DataFrames Simple and Fast Analysis of Structured Data - Michael Armbrust (Databricks)slide
Structuring Spark: DataFrames, Datasets, and Streamingslide
Spark in Production: Lessons from 100+ Production UsersEveryday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015
slide
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San Jose 2015Building a REST Job Server for Interactive Spark as a Service
slide
Building a REST Job Server for Interactive Spark as a ServiceEasy JSON Data Manipulation in Spark - Yin Huai (Databricks)
slide
Easy JSON Data Manipulation in Spark - Yin Huai (Databricks)Sparkling: Speculative Partition of Data for Spark Applications - Peilong Li
slide
Sparkling: Speculative Partition of Data for Spark Applications - Peilong LiNot Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture
slide
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture
我一直很欣赏 databricks 出的 video 和 slide,结构非常清晰,这个是其中一个非常好的演讲,里面有很多值得借鉴的地方,特别是当你像别人介绍你的工作,产品的时候。[我有一个感受,很少有人能清晰,有条理的介绍自己正在做的产品,对于一些小众的产品,甚至一些职业的销售也难以做到清晰,简明的叙述。这个 video 和 slide 有很大的参考价值。我自己感觉仔细研究这些 video 和 slide 有时候比看上一两本专业讲销售的书还要管用。]
- Getting The Best Performance With PySpark
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
slide
700 Queries Per Second with Updates: Spark As A Real-Time Web Serviceslide
Understanding Memory Management In Spark For Fun And Profitslide
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Data Storage Tips for Optimal Spark Performance - Vida Ha (Databricks)
slide
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland
6. next
上面的资源我都会不断更新的,里面 80% 以上的都是我亲自看过并且觉得有价值的,可不是胡乱收集一通的,推荐欣赏哦。
7. 打开微信,扫一扫,点一点,棒棒的,^_^
本系列文章链接
- 『 Spark 』1. spark 简介
- 『 Spark 』2. spark 基本概念解析
- 『 Spark 』3. spark 编程模式
- 『 Spark 』4. spark 之 RDD
- 『 Spark 』5. 这些年,你不能错过的 spark 学习资源
- 『 Spark 』6. 深入研究 spark 运行原理之 job, stage, task
- 『 Spark 』7. 使用 Spark DataFrame 进行大数据分析
- 『 Spark 』8. 实战案例 | Spark 在金融领域的应用 | 日内走势预测
- 『 Spark 』9. 搭建 IPython + Notebook + Spark 开发环境
- 『 Spark 』10. spark 应用程序性能优化|12 个优化方法
- 『 Spark 』11. spark 机器学习
- 『 Spark 』12. Spark 2.0 特性介绍
- 『 Spark 』13. Spark 2.0 Release Notes 中文版
- 『 Spark 』14. 一次 Spark SQL 性能优化之旅