{"id":2915,"date":"2012-09-02T06:59:00","date_gmt":"2012-09-02T03:59:00","guid":{"rendered":"http:\/\/daimon.me\/blog\/?p=2915"},"modified":"2017-01-02T10:19:41","modified_gmt":"2017-01-02T08:19:41","slug":"dupa-5-luni-zelist","status":"publish","type":"post","link":"http:\/\/daimon.me\/blog\/2012\/09\/dupa-5-luni-zelist\/","title":{"rendered":"Dup\u0103 5 luni, Zelist &#8230;"},"content":{"rendered":"<div class=\"entry\">\n<p style=\"text-align: justify;\">Undeva \u00een luna Aprilie a anului curent extr\u0103geam <a href=\"https:\/\/daimon.me\/blog\/2012\/04\/ce-monitorizeaza-de-fapt-zelist\/\" target=\"_blank\">lista de bloguri<\/a> a Zelist; poate v\u0103 aminti\u0163i. Ei, iat\u0103, au trecut aproape 5 luni de atunci, \u015fi am anumite curiozit\u0103\u0163i vizavi de topul lor. Spre exemplu, oare c\u00e2te bloguri or fi ad\u0103ugat de atunci p\u00e2n\u0103 acum? \u015ei mai exact ce bloguri? \u00cenainte de a putea trece la a r\u0103spunde \u00eentreb\u0103rii, vom avea nevoie de o copie a listei curente a Zelist. Cum data trecut\u0103 nu am explicat clar cum se poate face acest lucru, cred c\u0103 e momentul s-o fac.<\/p>\n<p style=\"text-align: center;\">~*~<\/p>\n<p>Ave\u0163i nevoie pentru \u00eenceput (<em>prerequisite<\/em>) fie de o distribu\u0163ie Linux fie de softul <a href=\"http:\/\/www.cygwin.com\/\" target=\"_blank\">Cygwin<\/a> pentru Windows. \u00cen cazul \u00een care folosi\u0163i Cygwin, trebuie selectat manual Curl (\u00een categoria Net), el nefiind inclus \u00een pachetul predefinit pentru instalare; de asemenea, nu este deloc recomandat\u0103 o instalare complet\u0103 a acestui software, fiind o opera\u0163iune ce poate dura c\u00e2teva ore bune \u015fi care v-ar l\u0103sa mai s\u0103raci cu 12gb.<\/p>\n<p>Bun, pornim terminalul, alegem un director de lucru, facem acolo un script (<span style=\"color: #800000;\">stiinta.sh<\/span> e un nume fain), iar \u00een script zicem a\u015fa:<\/p>\n<blockquote><p><code>for ( (i=1; i&lt;=<span style=\"color: #800000;\">1297<\/span>; i++) )<br \/>\ndo<br \/>\n<span style=\"color: #800000;\"># echo \"Pagina $i\" &gt;&gt; f1.txt;<\/span> # aceasta linie nu e musai necesara<br \/>\ncurl \"http:\/\/www.zelist.ro\/bloguri\/pagina-$i.html\" | grep \"infoBtn\" &gt;&gt; f1.txt;<br \/>\ndone<\/code><code><\/code><\/p><\/blockquote>\n<p style=\"text-align: justify;\"><del>Nota\u0163i c\u0103 ghilimelele puse de WordPress sunt gre\u015fite, ave\u0163i a pune manual ghilimele normale. A, \u015fi e de preferat s\u0103 edita\u0163i scriptul \u00een Notepad++, unde la final da\u0163i din meniu Edit &#8211; EOL Conversion &#8211; Unix format; dac\u0103 edita\u0163i cu Notepad din Windows v\u0103 va b\u0103ga ni\u015fte caractere pe care Cygwin nu le diger\u0103 prea bine. Ah, \u015fi nu l\u0103s\u0103m spa\u0163iu \u00eentre parantezele de la for, eu fac asta din motive de plugins ale WordPress ((din dezavantajele tehnologiei lol)).<\/del><\/p>\n<p style=\"text-align: justify;\">(edit: Patru ani mai t\u00e2rziu, am descoperit tagul numit code..)<\/p>\n<p style=\"text-align: justify;\">Textul se copiaz\u0103 \u00eentr-un editor care poate salva fi\u015fierul \u00een format specific pentru Linux &#8211; Notepadul simplu nu \u015ftie, \u00eens\u0103 Notepad++ \u015ftie. Iar parantezele duble se scriu f\u0103r\u0103 spa\u0163iu \u00eentre ele.<\/p>\n<p style=\"text-align: justify;\">Este destul de simplu de \u00een\u0163eles ce face scriptul dac\u0103 v\u0103 \u015fti\u0163i cu variabile (de la mate, s\u0103 zicem): avem o bucl\u0103 (<span style=\"color: #800000;\">for<\/span>) care face 1297 pa\u015fi (variabila i), la fiecare pas scriind \u00een fi\u015fierul f1.txt dou\u0103 r\u00e2nduri. <span style=\"color: #800000;\">Curl<\/span> ne aduce \u00eentreg con\u0163inutul paginii web, bara vertical\u0103 &#8220;<span style=\"color: #800000;\">|<\/span>&#8221; se nume\u015fte pipe \u015fi vars\u0103 rezultatul primei comenzi mai departe; <span style=\"color: #800000;\">grep<\/span> alege doar acele r\u00e2nduri care con\u0163in un anume text (<span style=\"color: #800000;\">infoBtn<\/span>), iar dac\u0103 privi\u0163i \u00een sursa html a unei pagini Zelist ve\u0163i \u00een\u0163elege \u015fi de ce ((editare: Articolul ini\u0163ial folosea onClick. \u00centre timp Treeworks au introdus varia\u0163iuni \u00een felul cum linkeaz\u0103 c\u0103tre blogurile listate, nefolosind javascript pentru toate elementele din list\u0103. La data curent\u0103 atributul infoBtn pare a fi prezent la toate blogurile din list\u0103, pasibil \u00eens\u0103 de a fi schimbat \u00een viitor)). Operatorul &#8220;&gt;&gt;&#8221; indic\u0103 un fi\u015fier \u00een care s\u0103 se scrie rezultatele \u00eentregii opera\u0163iuni, \u00een loc ca ele s\u0103 ne fie vomate pe ecran ((De fapt ave\u0163i mai multe op\u0163iuni: operatorul &#8220;&gt;&#8221; scrie \u015fi el \u00eentr-un fi\u015fier, \u00eens\u0103 dac\u0103 fi\u015fierul exist\u0103 deja atunci la fiecare pas el va fi rescris de la zero. Operatorul &#8220;&gt;&gt;&#8221; doar adaug\u0103 dac\u0103 fi\u015fierul exist\u0103 deja, lucru de care avem nevoie \u00een acest caz particular.)).<\/p>\n<p style=\"text-align: justify;\">Scriptul \u0103sta este \u015fi cel mai mare consumator de timp, de altfel &#8211; \u00een func\u0163ie de procesor \u015fi viteza conexiunii la internet, poate dura c\u00e2teva<strong> ore<\/strong> ((La data scrierii articolului, &#8220;topul&#8221; Zelist ar\u0103ta spre 60.000 de bloguri. Ast\u0103zi, \u00een Martie 2013, arat\u0103 doar top-5000, ceea ce reduce \u015fi timpul de rulare la doar c\u00e2teva zeci de minute)). La final vom avea un fi\u015fier\u00a0 de forma:<\/p>\n<blockquote>\n<p style=\"text-align: left;\">Pagina 1<br \/>\n&lt;a href=&#8221;http:\/\/tudorchirila.blogspot.com&#8221; onClick=&#8221;window.open(&#8216;http:\/\/tudorchirila.blogspot.com&#8217;);return false;&#8221;&gt;<br \/>\n&lt;a href=&#8221;http:\/\/www.umbrelaverde.ro&#8221; onClick=&#8221;window.open(&#8216;http:\/\/www.umbrelaverde.ro&#8217;);return false;&#8221;&gt;<br \/>\n&lt;a href=&#8221;http:\/\/www.ciutacu.ro&#8221; onClick=&#8221;window.open(&#8216;http:\/\/www.ciutacu.ro&#8217;);return false;&#8221;&gt;<\/p>\n<\/blockquote>\n<p style=\"text-align: justify;\">Deschide\u0163i <span style=\"color: #800000;\">f1.txt<\/span> \u00eentr-un editor care \u015ftie \u015fi s\u0103 numeroteze liniile (eu unul recomand <a href=\"http:\/\/notepad-plus-plus.org\/\" target=\"_blank\">Notepad++<\/a>). V\u0103 asigura\u0163i c\u0103 num\u0103rul de bloguri de pe saitul Zelist este egal cu num\u0103rul de linii din fi\u015fier (<em>minus 50, dac\u0103 a\u0163i p\u0103strat comanda echo \u00een script<\/em>). Dac\u0103 nu sunt egale numerele, v\u0103 sparge\u0163i capul s\u0103 detecta\u0163i unde-i gre\u015feala (dar \u00een principiu rezultatele ar trebui s\u0103 fie corecte).<\/p>\n<p style=\"text-align: justify;\">Pentru a elimina liniile care indic\u0103 pagina, folosim un grep inversat (care ne arat\u0103 liniile ce<strong> nu<\/strong> con\u0163in un anume text):<\/p>\n<blockquote><p>grep -v &#8220;Pagina&#8221; f1.txt &gt; f2.txt<\/p><\/blockquote>\n<p style=\"text-align: justify;\">Apoi edit\u0103m f2.txt \u015fi elimin\u0103m cu un simplu search &amp; replace textul din fa\u0163a adreselor web (la \u00eenceput sunt caractere tab, de acolo provine spa\u0163iul gol):<\/p>\n<blockquote><p>\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 &lt;a href=&#8221;<\/p><\/blockquote>\n<p style=\"text-align: justify;\">Ei, acum arat\u0103 ceva mai uman:<\/p>\n<blockquote><p>http:\/\/tudorchirila.blogspot.com&#8221; onClick=&#8221;window.open(&#8216;http:\/\/tudorchirila.blogspot.com&#8217;);return false;&#8221;&gt;<br \/>\nhttp:\/\/www.umbrelaverde.ro&#8221; onClick=&#8221;window.open(&#8216;http:\/\/www.umbrelaverde.ro&#8217;);return false;&#8221;&gt;<br \/>\nhttp:\/\/www.ciutacu.ro&#8221; onClick=&#8221;window.open(&#8216;http:\/\/www.ciutacu.ro&#8217;);return false;&#8221;&gt;<br \/>\nhttp:\/\/www.bookblog.ro&#8221; onClick=&#8221;window.open(&#8216;http:\/\/www.bookblog.ro&#8217;);return false;&#8221;&gt;<\/p><\/blockquote>\n<p>Ne mai r\u0103m\u00e2ne s\u0103 sc\u0103p\u0103m de resturile de html de dup\u0103 numele domeniului. Asta se poate face simplu, utiliz\u0103m comanda <span style=\"color: #800000;\">cut<\/span> c\u0103reia \u00eei d\u0103m semnul de ghilimele drept delimitator:<\/p>\n<blockquote><p>cat f2.txt | cut -f 1 -d &#8216;&#8221;&#8216; &gt;&gt; f3.txt<br \/>\n<span style=\"color: #800000;\"># -d apostrof ghilimele apostrof<br \/>\n# cut \u00eemparte textul \u00een mai multe buc\u0103\u0163i delimitate de anumite semne; tot ce-i \u00eenainte de <em>token<\/em> (ghilimele) va fi p\u0103strat (-f 1), tot ce-i dup\u0103 se arunc\u0103<br \/>\n<\/span><\/p><\/blockquote>\n<p style=\"text-align: justify;\">Se putea \u015fi mai simplu, d\u00e2nd f2.txt direct ca parametru pentru <span style=\"color: #800000;\">cut<\/span>, \u00eens\u0103 voiam s\u0103 mai folosesc o dat\u0103 un pipe. R\u0103m\u00e2ne sarcin\u0103 cititorului s\u0103 vad\u0103 ce-i cu comanda <span style=\"color: #800000;\">cat<\/span>. Ei, acum avem \u00eentr-adev\u0103r un f3 curat de tot, \u00een care avem doar adrese!<\/p>\n<blockquote><p>tudorchirila.blogspot.com<br \/>\numbrelaverde.ro<br \/>\nciutacu.ro<br \/>\nbookblog.ro<\/p><\/blockquote>\n<p style=\"text-align: justify;\">\u00cent\u00e2mplarea face \u00eens\u0103 c\u0103 Zelist p\u0103streaz\u0103 unele adrese cu numele subdomeniului &#8220;www.&#8221; \u00een fa\u0163\u0103. Cum facem? Solu\u0163ia rapid\u0103 este s\u0103 edit\u0103m manual cu search &amp; replace, \u00eens\u0103 dezavantajul ar fi c\u0103 astfel un domeniu numit &#8220;awww.ro&#8221; va deveni &#8220;aro&#8221;, deci inutilizabil. Eu am ales o solu\u0163ie manual\u0103, \u015fi anume:<\/p>\n<blockquote><p>grep &#8220;www.&#8221; f3.txt &gt; temp<br \/>\nsort temp &gt; temp2<\/p><\/blockquote>\n<p style=\"text-align: justify;\">Din ceva motiv, grep \u00eemi caut\u0103 doar textul &#8220;www&#8221;, f\u0103r\u0103 punct; \u0163inem minte. Fi\u015fierul temp2 reprezint\u0103 toate liniile ce vor fi afectate de opera\u0163iune, sortate alfabetic. Vasta majoritate chiar trebuie s\u0103 se afle acolo, \u00eens\u0103 primele 5 nu:<\/p>\n<blockquote><p>angiwww.blogspot.com<br \/>\nawww.ro<br \/>\nqqwwwrrrrttttt.blogspot.com<br \/>\nrawwww.blogspot.com<br \/>\nwww,fansf.wordpress.com<\/p><\/blockquote>\n<p style=\"text-align: justify;\">A\u015fa c\u0103 efectuez search &amp; replace \u00een editorul text, \u00eens\u0103 apoi revin \u015fi editez \u00eenapoi unde-i cazul. Astfel, \u00eentr-un final am ajuns la f4.txt, un fi\u015fier care poate fi comparat cu cel din Aprilie. Nici n-a durat mult.<\/p>\n<p><em>(continuarea \u00eentr-un articol viitor)<\/em><br \/>\n<em>(am urcat pe server <a href=\"http:\/\/daimon.me\/storage\/zr310812.rar\" target=\"_blank\">o arhiv\u0103<\/a> ce con\u0163ine fi\u015fierul f4.txt \u015fi la care voi ad\u0103uga fi\u015fiere relevante pe m\u0103sur\u0103 ce mai scriu articole)<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Undeva \u00een luna Aprilie a anului curent extr\u0103geam lista de bloguri a Zelist; poate v\u0103 aminti\u0163i. Ei, iat\u0103, au trecut aproape 5 luni de atunci, \u015fi am anumite curiozit\u0103\u0163i vizavi de topul lor. Spre exemplu, oare c\u00e2te bloguri or fi ad\u0103ugat de atunci p\u00e2n\u0103 acum? \u015ei mai exact ce bloguri? \u00cenainte de a putea trece &#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40],"tags":[],"class_list":{"0":"post-2915","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-internet-si-tehnica","7":"anons"},"_links":{"self":[{"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/posts\/2915","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/comments?post=2915"}],"version-history":[{"count":4,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/posts\/2915\/revisions"}],"predecessor-version":[{"id":4487,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/posts\/2915\/revisions\/4487"}],"wp:attachment":[{"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/media?parent=2915"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/categories?post=2915"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/daimon.me\/blog\/wp-json\/wp\/v2\/tags?post=2915"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}