{"id":287,"date":"2021-09-02T09:52:38","date_gmt":"2021-09-02T07:52:38","guid":{"rendered":"http:\/\/baptiste-meynier.com\/?p=287"},"modified":"2023-02-18T19:09:31","modified_gmt":"2023-02-18T18:09:31","slug":"spark","status":"publish","type":"post","link":"https:\/\/baptiste-meynier.com\/index.php\/2021\/09\/02\/spark\/","title":{"rendered":"Apache Spark"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Un peu d&#8217;histoire<\/h2>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"435\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2-1024x435.png\" alt=\"\" class=\"wp-image-791\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2-1024x435.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2-300x127.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2-768x326.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2-1536x652.png 1536w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/bigdata_timeline-2.png 2047w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Infrastructure<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Big data = gros disque ?<\/h3>\n\n\n\n<p><strong>HDD<\/strong> : est un bo\u00eetier contenant une s\u00e9rie de plateaux recouverts d&#8217;un rev\u00eatement ferromagn\u00e9tique. La direction de l&#8217;aimantation repr\u00e9sente les \u00e9l\u00e9ments binaires individuels. Les donn\u00e9es sont lues et \u00e9crites par une t\u00eate (similaire \u00e0 ce que peu faire une platine vinyle avec un disque) qui se d\u00e9place extr\u00eamement rapidement d&#8217;une zone du disque \u00e0 l&#8217;autre. \u00c9tant donn\u00e9 que toutes ces pi\u00e8ces sont \u00ab m\u00e9caniques \u00bb, le disque dur est le composant le plus lent de tout ordinateur et le plus fragile.<\/p>\n\n\n\n<p><strong>SSD<\/strong>&nbsp;: ces types de disques stockent des informations sur une m\u00e9moire flash compos\u00e9e de cellules de m\u00e9moire individuelles stockant des \u00e9l\u00e9ments binaires qui sont instantan\u00e9ment accessibles par le contr\u00f4leur.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">C&#8217;est quoi un controleur disque?<\/h3>\n\n\n\n<p>Il s&#8217;agit d&#8217;un ensemble \u00e9lectronique qui contr\u00f4le la m\u00e9canique d\u2019un disque dur. Son r\u00f4le est de piloter les moteurs de rotation, de positionner les t\u00eates de lecture\/enregistrement, et d\u2019interpr\u00e9ter les signaux \u00e9lectriques re\u00e7us de ces t\u00eates pour les convertir en donn\u00e9es exploitables ou d\u2019enregistrer des donn\u00e9es \u00e0 un emplacement particulier de la surface des disques composant le disque dur.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Evolution du stockage \u00e0 travers le temps:<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-table is-style-regular\"><table><thead><tr><th><\/th><th><\/th><th><\/th><th><\/th><th>HDD<\/th><th><\/th><th><\/th><th><\/th><th>SSD<\/th><th><\/th><\/tr><\/thead><tbody><tr><td><strong>Ann\u00e9es<\/strong><\/td><td><strong>1956<\/strong><\/td><td><strong>1970<\/strong><\/td><td><strong>1980<\/strong><\/td><td><strong>1990<\/strong><\/td><td><strong>2000<\/strong><\/td><td><strong>2010<\/strong><\/td><td><strong>2020<\/strong><\/td><td><\/td><td><\/td><\/tr><tr><td><strong>Format<\/strong><\/td><td>50 plateaux de 24&#8243;<\/td><td><\/td><td>5,25&#8221;<\/td><td>3,5&#8243;<\/td><td>3,5&#8243;<\/td><td>2,5&#8243;<\/td><td>2,5&#8243;<\/td><td>2,5&#8243;<\/td><td><\/td><\/tr><tr><td><strong>Volume Max<\/strong><\/td><td>5 MB<\/td><td>60 MB<\/td><td>446 MB<\/td><td>1000 MB<\/td><td>18 000 MB<\/td><td>4 000 000 MB<\/td><td>20 000 000 MB<\/td><td>4 000 000 MB<\/td><td><\/td><\/tr><tr><td><strong>D\u00e9bit th\u00e9orique<\/strong><\/td><td>8 800 caract\u00e8res\/s<\/td><td><\/td><td>625 Ko\/s<\/td><td>0.7 Mb\/s<\/td><td>9.5 Mb\/s<\/td><td>110 Mb\/s<\/td><td>200Mb\/s<\/td><td>500 Mb\/s<\/td><td><\/td><\/tr><tr><td><strong>Prix pour 1 Mo<\/strong><\/td><td>1mo  10 000 $<\/td><td><\/td><td>1mo 193 $<\/td><td>1 mo  9 $<\/td><td>1 Go  2 \u20ac<\/td><td>1 Go 0.10\u20ac <\/td><td>1Go 0.02\u20ac<\/td><td>1Go<br>0.15\u20ac<\/td><td><\/td><\/tr><tr><td><strong>Rapport Volume Max\/D\u00e9bit<\/strong><\/td><td><\/td><td><\/td><td>713.6<\/td><td>1428<\/td><td>1894<\/td><td>36363<\/td><td>100000<\/td><td>8000<\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/storage_trends.png\" alt=\"\" class=\"wp-image-1128\" width=\"-338\" height=\"-266\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/storage_trends.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/storage_trends-300x236.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/storage_trends-768x603.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>Dans le cas de stockage sur fichier il vaut mieux privil\u00e9gier les petits disques durs plus pr\u00e9cis\u00e9ment ceux ayant le meilleur ratio volume\/d\u00e9bit.<\/p>\n\n\n\n<p><em>Note: il est possible de connaitre le type de disque dur de votre machine via la commande &#8220;lsblk -d -o name,rota&#8221; (HDD rota=1, SSD rota=0)<\/em><\/p>\n<\/div>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Quelques exemples de Stockages:<\/h3>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:16% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"340\" height=\"216\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/hdfs.png\" alt=\"\" class=\"wp-image-698 size-full\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/hdfs.png 340w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/hdfs-300x191.png 300w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-medium-font-size\">permet de stocker des fichiers bruts ou structur\u00e9s via le format Parquet.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"220\" height=\"300\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/amazon_S3-220x300.png\" alt=\"\" class=\"wp-image-942 size-medium\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/amazon_S3-220x300.png 220w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/amazon_S3.png 256w\" sizes=\"auto, (max-width: 220px) 100vw, 220px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-medium-font-size\">est un site d&#8217;h\u00e9bergement de fichiers. Son API est acc\u00e9ssible via REST \/ SOAP ou encore BitTorent.<\/p>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"279\" height=\"187\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/Apache-Cassandra-1.svg\" alt=\"\" class=\"wp-image-697 size-full\"\/><\/figure><div class=\"wp-block-media-text__content\">\n<p class=\"has-medium-font-size\">est la Rolls Royces des syst\u00e9mes de gestion de base de donn\u00e9es(NoSQL). Elle est con\u00e7u pour g\u00e9rer des quantit\u00e9s massives de donn\u00e9es sur un grand nombre de serveurs, assurant une haute disponibilit\u00e9 en \u00e9liminant les points de d\u00e9faillance. Il permet de r\u00e9partir les donn\u00e9es sur plusieurs centres, avec une r\u00e9plication asynchrone sans n\u0153ud ma\u00eetre et une faible latence pour les op\u00e9rations de tous les clients.<\/p>\n<\/div><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Ce qu&#8217;il faut retenir de la partie Stockage:<\/h3>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Int\u00e9ressez-vous au controleur de votre disque et rien ne sert de prendre des disques de trop grand volume le plus important reste le ratio volume\/d\u00e9bit.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Et le R\u00e9seaux dans tout \u00e7a ?<\/h3>\n\n\n\n<p>La carte r\u00e9seau assure l&#8217;interface entre l&#8217;\u00e9quipement et les machines connect\u00e9es sur le m\u00eame r\u00e9seau.  Les d\u00e9bits s&#8217;expriment g\u00e9n\u00e9ralement en Mbit\/s (m\u00e9gabites par seconde&nbsp;: millions de bits par seconde).<\/p>\n\n\n\n<p>Les d\u00e9bits actuels du standard Ethernet sont&nbsp;:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>10 Mbit\/s&nbsp;;<\/li>\n\n\n\n<li>100 Mbit\/s (Fast Ethernet)&nbsp;;<\/li>\n\n\n\n<li>1&nbsp;000 Mbit\/s parfois \u00e9galement not\u00e9 1 Gbit\/s (Gigabit Ethernet)&nbsp;;<\/li>\n\n\n\n<li>10&nbsp;000 Mbit\/s (10 Gigabit Ethernet).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Y-a-t&#8217;il des \u00e9l\u00e9ments \u00e0 prendre en compte entre r\u00e9seau et disques ?<\/h3>\n\n\n\n<p>Il est important de garder en t\u00eate que le controleur r\u00e9seaux est une contraintes.<\/p>\n\n\n\n<p>Supposons que nous disposons de l&#8217;Ethernet 1Gb en Full Duplex (\u00e9mission  et r\u00e9ception simultan\u00e9e).<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1023\" height=\"493\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/saturation-1.png\" alt=\"\" class=\"wp-image-959\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/saturation-1.png 1023w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/saturation-1-300x145.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/saturation-1-768x370.png 768w\" sizes=\"auto, (max-width: 1023px) 100vw, 1023px\" \/><\/figure>\n\n\n\n<p><em>Note: Spark pr\u00e9conise un \u00e9thernet de 10 GB<\/em><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Et si on mettait nos noeuds sur un gros clusters avec des VM ou des conteneurs ?<\/h3>\n\n\n\n<p>La virtualisation tout comme la conteneurisation est une abstraction du mat\u00e9riel. En choisissant cette m\u00e9thode nous perdons le contr\u00f4le de notre infrastructure. De plus nous sommes toujours contraint aux m\u00eames probl\u00e9matiques du goulot d&#8217;\u00e9tranglement du controleur r\u00e9seaux<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"529\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/container.png\" alt=\"\" class=\"wp-image-685\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/container.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/container-300x155.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/container-768x397.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Note: Lorsque qu&#8217;un applicatif h\u00e9berg\u00e9 sur VM \u00e0 besoin d&#8217;acc\u00e9der \u00e0 du mat\u00e9riel, l&#8217;OS h\u00e9berg\u00e9 op\u00e8re un certains nombre de convertions avec son OS host. Cela implique de la latence et du travail CPU.<\/em><\/p>\n\n\n\n<p>L&#8217;id\u00e9e est d&#8217;avoir une architecture compos\u00e9e de petites machines dites &#8220;jetables&#8221;. Ne jamais acheter de grosses machines !<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"477\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/infrastructure_architecture-1024x477.png\" alt=\"\" class=\"wp-image-924\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/infrastructure_architecture-1024x477.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/infrastructure_architecture-300x140.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/infrastructure_architecture-768x358.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/infrastructure_architecture-1536x716.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Note: Il est recommand\u00e9 de d\u00e9sactiver le swap<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Mais comment acc\u00e9der \u00e0 la donn\u00e9es ?<\/h2>\n\n\n\n<p>Il existe plusieurs outils pour exploiter de la donn\u00e9es Big Data.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ETL (Apache Pig, Talend, &#8230;)<\/li>\n\n\n\n<li>Framework imposant le pattern Actor (<a rel=\"noreferrer noopener\" href=\"http:\/\/baptiste-meynier.com\/index.php\/2021\/02\/17\/vert-x\/\" target=\"_blank\">VertX<\/a>, Spring Reactor, Akka)<\/li>\n\n\n\n<li>Streaming (Kafka, Flink)<\/li>\n\n\n\n<li>Et celui qui nous int\u00e9resse, <strong>Spark<\/strong> !<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"279\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/bigdata_basic_concept-1.png\" alt=\"\" class=\"wp-image-711\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/bigdata_basic_concept-1.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/bigdata_basic_concept-1-300x82.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/bigdata_basic_concept-1-768x209.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p><em>Note: Il est important de comprendre que le stockage et le calcul vont de paire comme Bonnie and Clyde, Tintin et Milou&#8230; L&#8217;un ne va pas<\/em> sans l&#8217;autre!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">L&#8217;architecture de Spark se repose sur le calcul distribu\u00e9 <\/h2>\n\n\n\n<p>L&#8217;id\u00e9e de Spark est de r\u00e9partir la charge de calcul et m\u00e9moire, sur l&#8217;ensemble des machines. Attention, toutes les op\u00e9rations ne sont pas toujours d\u00e9composables. Elle doit \u00eatre <strong>commutative<\/strong> (x*y = y*x) et <strong>associative<\/strong> ((x*y)*z = x*(y*z).<\/p>\n\n\n\n<p>Pour illustrer nos propos prenons l&#8217;exemple de l&#8217;addition 351 + 128 et appliquons la m\u00e9thode de calcul par rang (centaines, dizaines, unit\u00e9s).<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/compute_distribution_concept.png\" alt=\"\" class=\"wp-image-727\" width=\"859\" height=\"473\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/compute_distribution_concept.png 1025w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/compute_distribution_concept-300x165.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/compute_distribution_concept-768x423.png 768w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/figure>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">Une g\u00e9n\u00e9ration de rapport trop gourmande:<\/h3>\n\n\n\n<p>Je suis dans une entreprise qui g\u00e9n\u00e8re des rapports n\u00e9c\u00e9ssitant une empreinte m\u00e9moire de 64GB.  Je dispose d&#8217;une infrastructure compos\u00e9e d&#8217;un rac de 128 GB sur lequel j&#8217;ai 2 instances (2 * 64) . Tout ce passe pour le mieux, jusqu&#8217;au jour ou le m\u00e9tier demande de g\u00e9n\u00e9rer d&#8217;avantage de donn\u00e9es. Que dois-je faire ? Acheter un nouveau rac plus puissant ? <\/p>\n\n\n\n<p><strong>L&#8217;option du parall\u00e9lisme:<\/strong><\/p>\n\n\n\n<p>Je d\u00e9cide plutot d&#8217;acheter 10 machines de 16 GB de RAM. Ce qui nous permettra de g\u00e9n\u00e9rer des rapports ayant une empreinte de 160 GB. Plut\u00f4t que d&#8217;avoir une seule grosse machine avec 120Gb ! Spark r\u00e9partit le calcul et l&#8217;empreinte m\u00e9moire !<\/p>\n\n\n\n<p><em>Note: Ne pas oublier que la scalabilit\u00e9 verticale est rarement la solution, au contraire dans la grande majorit\u00e9 des cas elle vous causera des probl\u00e8mes!<\/em><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Cette fois les contraintes mat\u00e9riels ne sont pas les m\u00eames, c&#8217;est la RAM qu&#8217;il faut favoriser.<\/p>\n<\/blockquote>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Quelques exemples d&#8217;architecture:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>La plus courante:<\/strong> 1 cluster Spark + 1 cluster de stockage<\/li>\n\n\n\n<li><strong>S\u00e9gregu\u00e9e:<\/strong> 1 cluster Spark pour l&#8217;insertion + 1 cluster Spark de report + 1 cluster de stockage<\/li>\n\n\n\n<li><strong>La plus performante:<\/strong> 1 cluster mutualisant Spark et Stockage (Cassandra). Elle n\u00e9cessite des noeuds disposant de beaucoup de RAM et de tr\u00e8s bon disques.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Ce qu&#8217;il faut retenir de la partie calcul:<\/h3>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Tout comme le stockage de bons \u00e9quipement r\u00e9seaux sont n\u00e9c\u00e9ssaires. Cette partie n\u00e9cessite \u00e9galement de la RAM pas n\u00e9cessairement en gros volume, mais r\u00e9pliqu\u00e9e plusieurs fois.<\/p>\n<\/blockquote>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Les composants <\/h2>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"447\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-1024x447.png\" alt=\"\" class=\"wp-image-657\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-1024x447.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-300x131.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-768x335.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-1536x671.png 1536w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/librairies-1-2048x895.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Et si on suivait le cycle de vie d&#8217;une requ\u00eate Spark ?<\/h2>\n\n\n\n<p>Pour mieux comprendre l&#8217;architecture de Spark nous nous baserons sur l&#8217;execution d&#8217;un code permettant de classer par ordre d\u00e9croissant les mot les plus utilis\u00e9s contenus dans un roman d&#8217;Emile Zola:<\/p>\n\n\n\n<pre title=\"words-most-used\" class=\"wp-block-code\"><code lang=\"java\" class=\"language-java line-numbers\">sc.textFile(\"\/home\/baptiste\/Workspace\/ventreDeParis.txt\")\n.flatMap(text =&gt; text.split(\" \"))\n.filter(word =&gt; word.length !=0)\n.map(word =&gt; word.replaceAll(\",\",\"\"))\n.map({ word =&gt; (word, 1) })\n.reduceByKey( _ + _ )\n.sortBy(t =&gt; t._2,false)\n.collect()<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>Spark utilise une architecture <strong>leader<\/strong>\/<strong>follower<\/strong>. La soumission d&#8217;une requ\u00eate s&#8217;effectue depuis le Maitre, par exemple via le Spark Shell (1). <\/p>\n\n\n\n<p>Une fois soumis le code est envoy\u00e9 au Driver(2). Il s&#8217;agit d&#8217;une application Java ind\u00e9pendante ayant pour but d&#8217;organiser le travail des followers \u00e0 travers  le r\u00e9seau. C&#8217;est en quelques sorte le cerveau de notre architecture.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"566\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview-1024x566.png\" alt=\"\" class=\"wp-image-1132\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview-1024x566.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview-300x166.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview-768x424.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview-1536x848.png 1536w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/architecture_overview.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Note: le master pi\u00e8ce maitresse de l&#8217;architecture Spark en est aussi son SPOF.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Le Spark Context<\/h3>\n\n\n\n<p>Il s&#8217;agit du point d&#8217;entr\u00e9e de Spark, il permet de:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Se connecter \u00e0 l&#8217;environnement d&#8217;\u00e9xecution Spark<\/li>\n\n\n\n<li>R\u00e9cup\u00e9rer le status de l&#8217;application Spark<\/li>\n\n\n\n<li>Annuler un Job\/Stage<\/li>\n\n\n\n<li>Lancer un Job de mani\u00e8re synchroniz\u00e9 ou non<\/li>\n\n\n\n<li>Allouer programmatiquement des ressources<\/li>\n\n\n\n<li>D\u00e9composer le code en unit\u00e9s de traitement plus petites<\/li>\n\n\n\n<li>Manager des RDD<\/li>\n\n\n\n<li>&#8230;<\/li>\n\n\n\n<li><\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">C&#8217;est quoi un RDD ?<\/h4>\n\n\n\n<p>C&#8217;est l&#8217;objet de base que vous utiliserez tout le temps! Il signifie <strong>Resilient Distributed Datasets<\/strong>, c&#8217;est une collection de donn\u00e9es redond\u00e9e et distribu\u00e9e \u00e0 travers les noeuds d&#8217;executions. Si son m\u00e9canisme est compliqu\u00e9, son apparence est simple il s&#8217;agit la plupart du temps d&#8217;un tableau multi colonnes:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"512\" height=\"409\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/RDD-1.png\" alt=\"\" class=\"wp-image-844\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/RDD-1.png 512w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/RDD-1-300x240.png 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/figure>\n<\/div>\n\n\n<div class=\"wp-block-group has-background\" style=\"background-color:#e2f4ff\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p>RDD est par d\u00e9finition:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>In-memory<\/strong> (stock\u00e9 en ram)<\/li>\n\n\n\n<li><strong>Immuable<\/strong> (pour pouvoir \u00eatre trait\u00e9 de mani\u00e8re concurrente)<\/li>\n\n\n\n<li><strong>Lazy<\/strong> (nous d\u00e9taillerons ce comportement plus tard)<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<p><em>Note: un RDD ne poss\u00e8de pas de nom de colonne<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tu nous parles de Job\/Stage, mais qu&#8217;est ce que c&#8217;est ?<\/h4>\n\n\n\n<p>Une fois la requ\u00eate soumise, le Spark Context va d\u00e9composer le code en unit\u00e9s d&#8217;ex\u00e9cution plus petites pour faciliter sa distribution.<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\"><div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/encapsulation-1024x1024.png\" alt=\"\" class=\"wp-image-651\" width=\"327\" height=\"327\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/encapsulation.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/encapsulation-300x300.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/encapsulation-150x150.png 150w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/encapsulation-768x768.png 768w\" sizes=\"auto, (max-width: 327px) 100vw, 327px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"has-white-color has-text-color\"><\/p>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><br><\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p class=\"has-text-align-left\">L&#8217;<strong>Application<\/strong> est au sommet de la hierarchie, il s&#8217;agit du programme que vous avez demand\u00e9 d&#8217;executer.<\/p>\n\n\n\n<p class=\"has-text-align-left\">Le <strong>Job<\/strong> correspond \u00e0 l&#8217;execution d&#8217;une instruction. <\/p>\n\n\n\n<p class=\"has-text-align-left\">Le <strong>Stage<\/strong> est une partie de l&#8217;execution d&#8217;une instruction (Job).<\/p>\n\n\n\n<p class=\"has-text-align-left\">La <strong>Task<\/strong> est la granularit\u00e9 la plus basse dans Spark, il s&#8217;agit de t\u00e2ches qui seront execut\u00e9es localement par un unique executeur.<\/p>\n<\/div><\/div>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n<\/div><\/div>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><br><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Directed Acyclic Graph (DAG)<\/h4>\n\n\n\n<p>C&#8217;est lui qui est charg\u00e9 de d\u00e9couper le <strong>Job Spark<\/strong> en unit\u00e9s d&#8217;execution.<\/p>\n\n\n\n<div class=\"wp-block-group has-background\" style=\"background-color:#e2f4ff\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p><strong>[D]irected<\/strong>: est une sequence avec un d\u00e9but et une fin<\/p>\n\n\n\n<p><strong>[A]cyclic<\/strong>: qui ne poss\u00e8de pas de cycle<\/p>\n\n\n\n<p><strong>[G]raph<\/strong>: est un ensemble d&#8217;\u00e9tats et de transitions<\/p>\n<\/div><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Reprenons notre Job<\/strong><\/p>\n\n\n\n<p>Il existe deux types d&#8217;op\u00e9rations, les <strong>transformations<\/strong> et les <strong>actions<\/strong>.<\/p>\n\n\n\n<pre title=\"\" class=\"wp-block-code\"><code lang=\"java\" class=\"language-java line-numbers\">sc.textFile(\"\/home\/baptiste\/Workspace\/ventreDeParis.txt\")\n.flatMap(text =&gt; text.split(\" \"))     &lt;== Transformation\n.filter(word =&gt; word.length !=0)      &lt;== Transformation\n.map(word =&gt; word.replaceAll(\",\",\"\")) &lt;== Transformation\n.map(word =&gt; (word, 1))               &lt;== Transformation\n.reduceByKey( _ + _ )                 &lt;== Transformation\n.sortBy(t =&gt; t._2,false)              &lt;== Transformation\n.collect()                            &lt;== Action<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Transformations<\/h3>\n\n\n\n<p>Cette op\u00e9ration est dites &#8220;Lazy&#8221;. Les transformations sont d&#8217;abord stock\u00e9es dans le DAG, puis \u00e9valu\u00e9es au moment de l&#8217;execution d&#8217;une <strong>action<\/strong>.<\/p>\n\n\n\n<p>Voici quelques exemples de transformations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>flatMap: cette fonction retourne plusieurs elements pour une seule entr\u00e9e fournie par le RDD.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.flatMap(text =&gt; text.split(\" \"))<\/pre>\n\n\n\n<p id=\"177e\">Je s\u00e9pare  le contenu du chapitre en fonction des espaces pour avoir une liste de mots.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>filter: filtre la donn\u00e9e en tenant compte d&#8217;une condition<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.filter(word =&gt; word.length !=0)<\/pre>\n\n\n\n<p>Je supprime les chaines vides.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>map<strong>:<\/strong> applique une fonction de transformation \u00e0 la donn\u00e9e.<\/li>\n<\/ul>\n\n\n\n<p id=\"7205\">Dans ce RDD de mots, je retire toutes les virgules.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.map(word =&gt; word.replaceAll(\",\",\"\"))<\/pre>\n\n\n\n<p id=\"2717\"><em>Note: L&#8217;expression sp\u00e9cifi\u00e9e \u00e0 l&#8217;interieur de la fonction map est une fonction sans nom appel\u00e9e lambda ou fonction anonyme.<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>reduceByKey: regroupe les clefs identiques et applique une fonction sur leurs valeurs.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.reduceByKey( _ + _ ) <\/pre>\n\n\n\n<p><em>Note: Ici une notation scala, le underscore permet d&#8217;omettre la d\u00e9claration de la variable en entr\u00e9e<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>sortByKey: classe une collection de type clef\/valeur (par d\u00e9faut la clef est class\u00e9e de mani\u00e8re ascendante).<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.sortBy(t=&gt;t._2,false)<\/pre>\n\n\n\n<p id=\"de3c\">Ici je trie en fonction de la valeur de mani\u00e8re descentante.<\/p>\n\n\n\n<p>Vous trouverez la liste exhaustive des transformations <a href=\"https:\/\/spark.apache.org\/docs\/latest\/rdd-programming-guide.html#transformations\" target=\"_blank\" rel=\"noreferrer noopener\">ici<\/a>.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1012\">Actions<\/h3>\n\n\n\n<p id=\"09d1\">Il s&#8217;agit d&#8217;une op\u00e9ration qui s&#8217;\u00e9xecute immediatement.<\/p>\n\n\n\n<p><em>Note: Pendant que la Transformation retourne un RDD, les Actions retournent du langage natif.<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Collect: retourne tous les \u00e9l\u00e9ments contenus dans le RDD.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">rdd.collect() <\/pre>\n\n\n\n<p><strong>Autres exemples:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Count: compte le nombre d&#8217;\u00e9l\u00e9ments pr\u00e9sents.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">num = sc.parallelize([1,2,3,4,2])num.count() # output : 5<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First: Retourne le premier \u00e9l\u00e9ment.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">num.first() # output : 1<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">DAG associ\u00e9 au Job:<\/h4>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"588\" height=\"888\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/dag.png\" alt=\"\" class=\"wp-image-828\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/dag.png 588w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/dag-199x300.png 199w\" sizes=\"auto, (max-width: 588px) 100vw, 588px\" \/><figcaption class=\"wp-element-caption\">Dag g\u00e9n\u00e9r\u00e9 par la console d&#8217;administration Spark<\/figcaption><\/figure>\n<\/div>\n\n\n<h4 class=\"wp-block-heading\">Task Scheduler<\/h4>\n\n\n\n<p>Il s&#8217;agit de l&#8217;interm\u00e9diaire entre le <strong>Leader<\/strong> et <strong>cluster manager<\/strong>. C&#8217;est lui qui envoie les taches g\u00e9n\u00e9r\u00e9es par le DAG vers le cluster manager, il est \u00e9galement charg\u00e9 de g\u00e9rer leurs cycles de vie (les rejouer en cas d&#8217;\u00e9chec) et temporiser en cas de besoin.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1025\" height=\"637\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/spark_context_2-1.png\" alt=\"\" class=\"wp-image-774\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/spark_context_2-1.png 1025w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/spark_context_2-1-300x186.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/spark_context_2-1-768x477.png 768w\" sizes=\"auto, (max-width: 1025px) 100vw, 1025px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Cluster manager<\/h2>\n\n\n\n<p>Le Cluster managers est responsable de l&#8217;allocation\/d\u00e9sallocation des ressources au sein du cluster Spark n\u00e9c\u00e9ssaire \u00e0 un Job Spark. Il s&#8217;agit d&#8217;un service totalement d\u00e9coupl\u00e9 de Spark.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/standalone.png\" alt=\"\" class=\"wp-image-751\" width=\"112\" height=\"101\"\/><\/figure>\n<\/div>\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n\n<p>Pour les petits clusters (moins de 100 machines). Le master doit \u00eatre coupl\u00e9 avec un Zookeeper.<\/p>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/apache_Mesos.png\" alt=\"\" class=\"wp-image-750\" width=\"99\" height=\"152\"\/><\/figure>\n<\/div>\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n\n<p>Excellent cluster manager, permet d&#8217;orchestrer les grands parcs<\/p>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/yarn.png\" alt=\"\" class=\"wp-image-753\" width=\"144\" height=\"109\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/yarn.png 338w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/yarn-300x227.png 300w\" sizes=\"auto, (max-width: 144px) 100vw, 144px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n\n<p>L&#8217;orchestrateur d&#8217;Hadoop est compatible Spark<\/p>\n\n\n\n<p class=\"has-white-color has-text-color\"><br><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/kubernetes.png\" alt=\"\" class=\"wp-image-752\" width=\"123\" height=\"144\"\/><\/figure>\n<\/div>\n\n\n<p class=\"has-white-color has-text-color\">.<\/p>\n\n\n\n<p>De nombreux providers Cloud proposent cette solution, elle \u00e0 l&#8217;avantage de rendre l&#8217;application plus portable. Il faudra cependant pr\u00e9ter attention aux IO.<\/p>\n\n\n\n<p class=\"has-white-color has-text-color\">.<\/p>\n\n\n\n<p><br><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Worker Node<\/h3>\n\n\n\n<p>La tache arrive enfin \u00e0 bon port! C&#8217;est le worker qui interroge la couche stockage de votre application. La RAM va traiter la tache et g\u00e9n\u00e9rer une <strong>partition<\/strong> d&#8217;un RDD. Cette m\u00e9moire sert aussi \u00e0 stocker en cache les RDD (Spark peut dans certains cas persister les RDD sur le disque). Une fois trait\u00e9 le r\u00e9sultat est renvoy\u00e9 au <strong>Leader<\/strong>. <\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"348\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors-1024x348.png\" alt=\"\" class=\"wp-image-811\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors-1024x348.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors-300x102.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors-768x261.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors-1536x522.png 1536w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/08\/executors.png 2047w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Les risques du calcul distribu\u00e9<\/h2>\n\n\n\n<p>Nous l&#8217;avions \u00e9voqu\u00e9 plus haut, les IO doivent concentrer toute notre attention, surtout lorsque nous \u00e9crivons une requ\u00eate Spark.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Le Shuffling<\/h3>\n\n\n\n<p>L&#8217;un des principaux risque d&#8217;un traitement Spark est le <strong>Shuffle<\/strong>. Illustrons ce terme avec un RDD stockant des informations sur des animaux.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"779\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey-1024x779.png\" alt=\"\" class=\"wp-image-642\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey-1024x779.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey-300x228.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey-768x584.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey-1536x1169.png 1536w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/groupByKey.png 2042w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Le shuffle est le fait de devoir <strong>remixer<\/strong> des donn\u00e9es et ainsi soliciter le r\u00e9seau. Cette op\u00e9ration doit \u00eatre \u00e0 tout prix minimis\u00e9 en modifiant notre requ\u00eate Spark. <\/p>\n\n\n\n<p>Comme vous avez pu le voir sur le sch\u00e9ma il existe 2 types d&#8217;op\u00e9rations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Narrow: <\/strong>l&#8217;op\u00e9ration n&#8217;implique qu&#8217;une seule partition parente pour cr\u00e9er le nouveau RDD. Il s&#8217;agit de la plus rapide\/pr\u00e9visible des op\u00e9rations. (Map, Filter, Union, &#8230;)  <\/li>\n\n\n\n<li><strong>Wide:<\/strong> est une op\u00e9ration qui s&#8217;applique sur plusieurs partitions parentes et implique un fort malaxage des donn\u00e9es. Il est important de l&#8217;utiliser avec pr\u00e9caution (GroupByKey).<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"779\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/reduceByKey-1024x779.png\" alt=\"\" class=\"wp-image-641\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/reduceByKey-1024x779.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/reduceByKey-300x228.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/reduceByKey-768x584.png 768w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/06\/reduceByKey-1536x1169.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Note: dans le cas d&#8217;une architecture ayant un executeur partag\u00e9 avec un noeud Cassandra le shuffling n&#8217;existe plus !<\/em><\/p>\n\n\n\n<p>Cette requ\u00eate est moins couteuse en IO r\u00e9seaux puisque l&#8217;op\u00e9ration de regroupement est effectu\u00e9e une premi\u00e8re fois au niveau de la partition. Vous remarquerez l&#8217;usage du pattern <strong>Map Reduce<\/strong>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Les Reduces:<\/h4>\n\n\n\n<p>Prenons le prototype le plus simple de la r\u00e9duction.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"java\" class=\"language-java\">def reduce(f: (T, T) =&gt; T): T<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"522\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/reduce-1.png\" alt=\"\" class=\"wp-image-987\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/reduce-1.png 1024w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/reduce-1-300x153.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/reduce-1-768x392.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Le calcul distribu\u00e9 impose l&#8217;usage de fonctions <strong>commutatives<\/strong> et <strong>associatives,<\/strong> que doit respecter la fonction de r\u00e9duction. En d&#8217;autre terme cela signifie que quelque soit l&#8217;ordre de la s\u00e9quence le r\u00e9sultat sera identique. Le m\u00e9canisme de r\u00e9duction repose \u00e9galement sur le concept d&#8217;<strong>accumulation<\/strong> et de <a rel=\"noreferrer noopener\" href=\"https:\/\/fr.wikipedia.org\/wiki\/R%C3%A9cursion_terminale\" target=\"_blank\">r\u00e9curcivit\u00e9 terminale<\/a>.<\/p>\n\n\n\n<p><em>Note: il est possible d&#8217;impl\u00e9menter son propre accumulateur via l&#8217;interface AccumulatorV2<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Requests as Design<\/h2>\n\n\n\n<p>Aussi appel\u00e9 <strong>Query First Design<\/strong>, l&#8217;id\u00e9e est de concevoir un mod\u00e8les de donn\u00e9es en fonction d&#8217;une requ\u00eate que l&#8217;on souhaite effectuer (en tenant compte du shuffling). Ceci implique de multiplier les mod\u00e8les de donn\u00e9es au sein de votre stockage.  Il est impossible de d\u00e9finir un mod\u00e8le de donn\u00e9es g\u00e9n\u00e9rique permettant de traiter de mani\u00e8re optimis\u00e9 tous les cas d&#8217;utilisations. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comment faire ?<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Penser son mod\u00e8le de donn\u00e9es en fonction de la requ\u00eate, r\u00e9fl\u00e9chir au positionnement des clefs de partitions pour limiter le shuffle, etc.<\/li>\n\n\n\n<li>Ins\u00e9rer la donn\u00e9e\n<ol class=\"wp-block-list\">\n<li>nettoyer (retirer les doublons, mettre en minuscule, retirer les accents, la ponctuation, certains mots inutiles, \u2026)<\/li>\n\n\n\n<li>valider (v\u00e9rifier que toutes les donn\u00e9es obligatoires sont pr\u00e9sentes, quelles sont coh\u00e9rentes avec votre m\u00e9tier, &#8230;)<\/li>\n\n\n\n<li>transformer (pour \u00eatre conforme \u00e0 votre mod\u00e8le de donn\u00e9es)<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>R\u00e9diger la requ\u00eate pour votre rapport (normalement c&#8217;est le plus simple, le plus gros du travail a \u00e9t\u00e9 fait pr\u00e9c\u00e9demment)<\/li>\n<\/ol>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Et si on s&#8217;y mettait ?<\/h2>\n\n\n\n<p>Spark dispose de deux modes, le mode <strong>Cluster <\/strong>que nous venons de voir et le mode <strong>Local<\/strong> que nous allons utiliser ici\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Installation<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">[baptiste@fedora ~] curl -s \"https:\/\/get.sdkman.io\" | bash\n[baptiste@fedora ~] sdk install java 11.0.11.hs-adpt\n[baptiste@fedora ~] sdk install scala 2.13.5\n[baptiste@fedora ~] sdk install sbt\n[baptiste@fedora ~] sdk install spark\n[baptiste@fedora ~] spark-shell\n Using Spark's default log4j profile: org\/apache\/spark\/log4j-defaults.properties\n Setting default log level to \"WARN\".\n To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n Spark context Web UI available at http:\/\/fedora:4040\n Spark context available as 'sc' (master = local[*], app id = local-1623176900967).\n Spark session available as 'spark'.\n Welcome to\n      ____              __\n    \/  __\/_   ___  ____\/ \/__\n   _\\  \\\/ _ \\\/ _ `\/ __\/  '_\/\n  \/____\/ .__\/\\_,_\/_\/ \/_\/\\_\\    version 3.1.1\n      \/_\/\n\n Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.11)\n Type in expressions to have them evaluated.\n Type :help for more information.\n scala&gt;<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">[baptiste@fedora current]$ tree .\n.\n \u251c\u2500\u2500 bin\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 beeline\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 docker-image-tool.sh\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 find-spark-home\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 load-spark-env.sh\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 pyspark\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 run-example\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-class\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 sparkR\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-shell\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-sql\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-submit\n \u251c\u2500\u2500 conf\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 fairscheduler.xml.template\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 log4j.properties.template\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 metrics.properties.template\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-defaults.conf.template\n \u2502&nbsp;&nbsp; \u251c\u2500\u2500 spark-env.sh.template\n \u2502&nbsp;&nbsp; \u2514\u2500\u2500 workers.template<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>Un Job peut \u00eatre soumis de plusieurs mani\u00e8res, de fa\u00e7on interactive avec le Spark-shell, Pyspark pour le python, via une class avec le spark-class ou encore un jar avec le spark submit.<\/p>\n\n\n\n<p><em>Note: le script spark-env.sh situ\u00e9 dans le r\u00e9pertoire conf permet de renseigner les adresses IP et ports des noeuds follower en mode Spark cluster.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Scala<\/h2>\n\n\n\n<p>Voici la structure d&#8217;un projet <a href=\"https:\/\/www.scala-sbt.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">sbt<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">\u251c\u2500\u2500 build.sbt\n\u251c\u2500\u2500 project\n\u2502&nbsp;&nbsp; \u2514\u2500\u2500 target\n\u251c\u2500\u2500 src\n\u2502&nbsp;&nbsp; \u251c\u2500\u2500 main\n\u2502&nbsp;&nbsp; \u2502&nbsp;&nbsp; \u2514\u2500\u2500 scala\n\u2502&nbsp;&nbsp; \u2502&nbsp;&nbsp;     \u2514\u2500\u2500 MeanLength.scala\n\u2502&nbsp;&nbsp; \u2514\u2500\u2500 test\n\u2502&nbsp;&nbsp;     \u2514\u2500\u2500 scala\n\u2514\u2500\u2500 target<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<pre title=\"build.sbt\" class=\"wp-block-code\"><code lang=\"java\" class=\"language-java\">name := \"spark-demo\"\n\nversion := \"0.1\"\n\nscalaVersion := \"2.12.14\"\n\nval sparkVersion = \"2.4.4\"\n\nlibraryDependencies ++= Seq(\n  \"org.apache.spark\" %% \"spark-core\" % sparkVersion % \"provided\"\n  , \"org.apache.spark\" %% \"spark-sql\"   % sparkVersion\n)\n\nidePackagePrefix := Some(\"com.bmeynier.article\")<\/code><\/pre>\n\n\n\n<p><em>Note: sbt est l&#8217;outils de build et de gestion des d\u00e9pendances du Scala. Il est \u00e9galement possible d&#8217;utiliser Maven<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">RDD<\/h2>\n\n\n\n<pre class=\"wp-block-code\" style=\"font-size:14px\"><code lang=\"java\" class=\"language-java line-numbers\"> import org.apache.spark._\n import org.apache.spark.SparkContext._\n import org.apache.spark.rdd.RDD\n object MeanLength {\n         def main(args: Array[String]): Unit = {\n                 val conf = new SparkConf().setAppName(\"mean-length\")\n                 val sc = new SparkContext(conf)\n                 val entry = sc.textFile(\"\/home\/baptiste\/Workspace\/ventreDeParis.txt\")\n\n                 val words = entry.flatMap(l =&gt; l.split(\" \"))\n                 val rdd_count_words = words.map(m =&gt; (\"WORDS\", 1)).reduceByKey{case (x,y) =&gt; x + y}\n                 val rdd_count_letters = words.map(m =&gt; (\"LETTERS\", m.length())).reduceByKey( _ + _ )\n\n                 val letters_count = rdd_count_letters.collect()(0)._2\n                 val words_count    = rdd_count_words.collect()(0)._2\n                 val means = letters_count \/ words_count.toFloat\n                 println(\"The words mean is:\"+means)\n       }\n}<\/code><\/pre>\n\n\n\n<p><strong>G\u00e9n\u00e9rer un jar:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">sbt clean package<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Soumettre le script \u00e0 Spark via un Jar:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">spark-submit target\/scala-2.12\/spark-demo_2.12-0.1.jar<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dataframe<\/h2>\n\n\n\n<p>Le dataframe est une API haut niveau du module Spark SQL, il s&#8217;agit en quelque sorte d&#8217;un RDD survitamin\u00e9! L&#8217;id\u00e9e est de pouvoir traiter un flux \u00e0 la mani\u00e8re d&#8217;une requ\u00eate SQL.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"java\" class=\"language-java\">import org.apache.spark._\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.functions.{split,explode,lit,length,sum}\nobject Code {\n    def main(args: Array[String]) {\n         val spark = SparkSession.builder().getOrCreate()\n         import spark.implicits._\n         val data = spark.read.format(\"text\").load(\"\/home\/baptiste\/Workspace\/ventreDeParis.txt\")\n\n         val wordsTab = data.select(split(data(\"value\"),\" \").alias(\"column1\"))\n         val words = wordsTab.select(explode(wordsTab(\"column1\")).alias(\"word\"))\n         val res = words.groupBy(\"word\").count\n         val table_with_length = words.withColumn(\"length\",length(words(\"word\")))\n         table_with_length.show()\n         val moy = table_with_length.select(mean(table_with_length(\"length\")).alias(\"mean\"))\n         moy.show(false)\n         res.show\n   }\n}<\/code><\/pre>\n\n\n\n<p><em>Note: Il est tout \u00e0 fait possible d&#8217;\u00e9crire son applicatif en python via PySpark, il faudra toutefois s&#8217;attendre \u00e0 des performances l\u00e9g\u00e8rements moins bonnes que dans un langage interpr\u00e9t\u00e9 nativement par la JVM.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">La console d&#8217;administration<\/h2>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"993\" height=\"1005\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_main.png\" alt=\"\" class=\"wp-image-1004\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_main.png 993w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_main-296x300.png 296w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_main-768x777.png 768w\" sizes=\"auto, (max-width: 993px) 100vw, 993px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1011\" height=\"930\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_stage.png\" alt=\"\" class=\"wp-image-1006\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_stage.png 1011w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_stage-300x276.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_stage-768x706.png 768w\" sizes=\"auto, (max-width: 1011px) 100vw, 1011px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1005\" height=\"967\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_dag.png\" alt=\"\" class=\"wp-image-1008\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_dag.png 1005w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_dag-300x289.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_dag-768x739.png 768w\" sizes=\"auto, (max-width: 1005px) 100vw, 1005px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1013\" height=\"897\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_details.png\" alt=\"\" class=\"wp-image-1007\" srcset=\"https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_details.png 1013w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_details-300x266.png 300w, https:\/\/baptiste-meynier.com\/wp-content\/uploads\/2021\/09\/console_job_details-768x680.png 768w\" sizes=\"auto, (max-width: 1013px) 100vw, 1013px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Code source<\/h2>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/gitlab.com\/bmeynier\/Spark\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/baptiste-meynier.com\/wp-content\/uploads\/2020\/10\/gitlab.png\" alt=\"\" class=\"wp-image-93\" width=\"117\" height=\"108\"\/><\/a><\/figure>\n<\/div>\n\n\n<p class=\"has-text-align-center\"><em>Code source<\/em><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Comptes Twitter \u00e0 suivre<\/h2>\n\n\n\n<p><a href=\"https:\/\/twitter.com\/databricks\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/twitter.com\/databricks<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/twitter.com\/matei_zaharia?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor\">https:\/\/twitter.com\/matei_zaharia?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/twitter.com\/jeffdean?lang=fr\">https:\/\/twitter.com\/jeffdean?lang=fr<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/twitter.com\/michaelarmbrust?lang=fr\">https:\/\/twitter.com\/michaelarmbrust?lang=fr<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">R\u00e9f\u00e9rences<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/databricks\/Spark-The-Definitive-Guide\">https:\/\/github.com\/databricks\/Spark-The-Definitive-Guide<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.usenix.org\/legacy\/publications\/library\/proceedings\/usenix01\/sugerman\/sugerman_html\/node7.html\">https:\/\/www.usenix.org\/legacy\/publications\/library\/proceedings\/usenix01\/sugerman\/sugerman_html\/node7.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/livebook.manning.com\/book\/mastering-large-datasets\/chapter-5\/9\">https:\/\/livebook.manning.com\/book\/mastering-large-datasets\/chapter-5\/9<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/dzone.com\/articles\/what-is-data-validation\">https:\/\/dzone.com\/articles\/what-is-data-validation<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/b3d.bdpedia.fr\/spark-batch.html\" target=\"_blank\" rel=\"noreferrer noopener\">http:\/\/b3d.bdpedia.fr\/spark-batch.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/spark.apache.org\/docs\/latest\/hardware-provisioning.html\">https:\/\/spark.apache.org\/docs\/latest\/hardware-provisioning.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\">https:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/spark.apache.org\/\">https:\/\/spark.apache.org\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/index.html\">https:\/\/spark.apache.org\/docs\/latest\/api\/scala\/org\/apache\/spark\/index.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/spark-architecture\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/techvidvan.com\/tutorials\/spark-architecture\/<\/a><\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/sparkbyexamples.com\/spark\/spark-shuffle-partitions\/\" target=\"_blank\">https:\/\/sparkbyexamples.com\/spark\/spark-shuffle-partitions<\/a>\/<\/p>\n\n\n\n<p><a href=\"https:\/\/medium.com\/expedia-group-tech\/start-your-journey-with-apache-spark-part-1-3575b20ee088\">https:\/\/medium.com\/expedia-group-tech\/start-your-journey-with-apache-spark-part-1-3575b20ee088<\/a><\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/meritis.fr\/spark-shuffle\/\" target=\"_blank\">https:\/\/meritis.fr\/spark-shuffle\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/univalence.io\/blog\/articles\/shuffle-dans-spark-reducebykey-vs-groupbykey\/\">https:\/\/univalence.io\/blog\/articles\/shuffle-dans-spark-reducebykey-vs-groupbykey\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/blog.engineering.publicissapient.fr\/2017\/09\/19\/tester-du-code-spark-2-la-pratique\/\">https:\/\/blog.engineering.publicissapient.fr\/2017\/09\/19\/tester-du-code-spark-2-la-pratique\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/cassandra.apache.org\/\">https:\/\/cassandra.apache.org\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/openclassrooms.com\/fr\/courses\/4462426-maitrisez-les-bases-de-donnees-nosql\/4462471-maitrisez-le-theoreme-de-cap\">https:\/\/openclassrooms.com\/fr\/courses\/4462426-maitrisez-les-bases-de-donnees-nosql\/4462471-maitrisez-le-theoreme-de-cap<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/events.static.linuxfound.org\/sites\/events\/files\/slides\/Spark-Cassandra%20Integration,%20theory%20and%20practice%20ApacheBigData%202015_0.pdf\">https:\/\/events.static.linuxfound.org\/sites\/events\/files\/slides\/Spark-Cassandra%20Integration,%20theory%20and%20practice%20ApacheBigData%202015_0.pdf<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/itnext.io\/spark-cassandra-all-you-need-to-know-tips-and-optimizations-d3810cc0bd4e\">https:\/\/itnext.io\/spark-cassandra-all-you-need-to-know-tips-and-optimizations-d3810cc0bd4e<\/a><\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/opencredo.com\/blogs\/deploy-spark-apache-cassandra\/\" target=\"_blank\">https:\/\/opencredo.com\/blogs\/deploy-spark-apache-cassandra\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.inpact-hardware.com\/article\/1751\/lextraordinaire-evolution-stockage\">https:\/\/www.inpact-hardware.com\/article\/1751\/lextraordinaire-evolution-stockage<\/a><\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/www.securedatarecovery.com\/blog\/hdd-storage-evolution\" target=\"_blank\">https:\/\/www.securedatarecovery.com\/blog\/hdd-storage-evolution<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.ibm.com\/ibm\/history\/exhibits\/storage\/storage_3330.html\">https:\/\/www.ibm.com\/ibm\/history\/exhibits\/storage\/storage_3330.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/blog.acelaboratory.com\/reading_speed.html\">https:\/\/blog.acelaboratory.com\/reading_speed.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.atlantic.net\/vps-hosting\/performance-matters-how-disk-throughput-and-iops-impact-your-business\/\">https:\/\/www.atlantic.net\/vps-hosting\/performance-matters-how-disk-throughput-and-iops-impact-your-business\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/cedric.cnam.fr\/vertigo\/Cours\/RCP216\/tpGraphes.html\">https:\/\/cedric.cnam.fr\/vertigo\/Cours\/RCP216\/tpGraphes.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/spark.apache.org\/docs\/latest\/running-on-kubernetes.html\">https:\/\/spark.apache.org\/docs\/latest\/running-on-kubernetes.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/fr.wikipedia.org\/wiki\/Arbre_de_Merkle\">https:\/\/fr.wikipedia.org\/wiki\/Arbre_de_Merkle<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/meritis.fr\/spark-shuffle\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/meritis.fr\/spark-shuffle\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Un peu d&#8217;histoire Infrastructure Big data = gros disque ? HDD : est un bo\u00eetier contenant une s\u00e9rie de plateaux recouverts d&#8217;un rev\u00eatement ferromagn\u00e9tique. La direction de l&#8217;aimantation repr\u00e9sente les \u00e9l\u00e9ments binaires individuels. Les donn\u00e9es sont lues et \u00e9crites par une t\u00eate (similaire \u00e0 ce que peu faire une platine vinyle avec un disque) qui &hellip;<\/p>\n","protected":false},"author":1,"featured_media":629,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[8],"tags":[],"class_list":["post-287","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata"],"_links":{"self":[{"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/posts\/287","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/comments?post=287"}],"version-history":[{"count":402,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/posts\/287\/revisions"}],"predecessor-version":[{"id":1856,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/posts\/287\/revisions\/1856"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/media\/629"}],"wp:attachment":[{"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/media?parent=287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/categories?post=287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/baptiste-meynier.com\/index.php\/wp-json\/wp\/v2\/tags?post=287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}