我有一个闲置的 DigitalOcean VPS,只用来跑一个 RSS 服务,网易云音乐海外代理,大部分时间计算资源被浪费。这里把这个 Debian VPS 配置成 docker machine 来跑容器玩。
1 | (Mac)$ docker version |
sudo
,确保 VPS 上的 sudo 不需要密码:1 | # visudo |
1 | $ docker-machine -D create \ |
-D
是 debug 模式。driver 可选 Google Compute,Amazon,Azure,DigitalOcean 官方的 Docker 服务,我们这里是 generic,任何 VPS 通用的。
这里默认的 key 是 ~/.ssh/id_rsa
, 如果不是要 --generic-ssh-key <path to private key>
。
如果想配置
如果顺利的话,就可以看见:
1 | $ docker-machine ls |
当前使用的是 Mac 上 VirtulBox 的 docker-machine default
, 可以用 eval $(docker-machine env asgard)
切换到 VPS。
最后,如果要注册 DigitalOcean, 可以用我的 referrer link。
我有一个需求,当学习一门新语言的时候,经常想看看一些库或者函数的 best practice。官方文章只告诉你如何调用,没有说什么时候调用,前后的 context 是什么,怎么组织代码。而在 github 搜索这个函数的代码片段的时候,出来的结果经常十几页,其中大量重复的结果,并且大部分代码质量并不怎么样,所以我的需求是:
sourcegraph 的自搭 container 就不错,支持 Golang, Java, Python, Javascript, Ruby (比 OpenGrok 好太多,也方便太多)。就是上面所说的需求2要付费(damn)。
sourcegraph 上演示用的命令是:
1 | docker run \ |
然后这是无法在 Mac 上直接跑的,貌似 volume 在$HOME 下都没法使用(见 issue)。
所以稍稍改一下:1
2
3
4
5docker run \
--publish 7080:7080 --rm \
--volume /data/sourcegraph/config:/etc/sourcegraph \
--volume /data/sourcegraph/data:/var/opt/sourcegraph \
sourcegraph/server:2.5.17
然后登陆 http://
注意,默认的 maxReposToSearch
太小,我把这改成200。
其次,这里有很多方式添加搜索的项目,可以输入 Github token 来检索你自己的仓库(自己的破代码有什么好搜索的),也可以添加如图第三方代码。我们这里选 “Add other reposity”,”repos.list” 会添加一个空白的 url, path 让你填写,然后会提示重启server, server 会自己 clone 代码,indexing。
这里我挑选的列表是 rsc/corpus —— Russ Cox 收集的 Golang 项目列表,能入 Russ 大佬法眼的代码,质量是有保证的。
仔细观察,Russ 是建了一个 bot,添加选好的 github 项目到 corpus 里。所以 commit message 很规范:
我们只要提取 commit message 中的项目地址,转换成 sourcegraph 配置需要的 url 和 path,就行啦~
这里用 Github API 得到 commit 信息,jq(方便好用的命令行 json 工具) 来过滤出 message。
1 | curl https://api.github.com/repos/rsc/corpus/commits\?per_page\=130 |jq '.[] | {message: .commit.message}' | grep addproject | grep -o 'github.com/\w\+/\w\+' > repos.txt |
biu, 就得到项目列表了(截止写文章时候,86个):
然后就是列表转配置,简单的几行 Golang 代码:
1 | func main() { |
编译成 converter, cat repos.txt | ./converter | pbcopy
, 复制到 sourcegraph 配置里(别忘了删掉配置json最后多出来的逗号)。
完成!项目需要 clone 一会,就会显示在 explore 里了:
也可以用来搜索想要的代码片段了。
小练习:用 Github API 添加自己标记过星的项目。
PS: 这样哐哐猛造会 git clone 一大堆代码,很快 docker-machine 的空间就造光了,记得调大一点。/data 属于 tmpfs 区,是被加载到内存里的(见这里),所以要修改 --virtualbox-memory
,默认1024M。 --virtualbox-disk-size
是用来存放 local image 的,也可以调大一点。
1 | docker@default:~$ df -h # after docker-machine ssh |
本文希望能帮助读者不打针,不吃药,不用安装 minikube,无痛地在Mac上愉快的与 Kubernetes 玩耍。
翻译改编自 Romin Irani 的 https://rominirani.com/tutorial-getting-started-with-kubernetes-with-docker-on-mac-7f58467203fd
在 Docker Edge 配置里,”Enable Kubernetes” -> “Apply” -> “Install”, 就会安装好所需的库,在后台跑起一个默认的 Kubernetes cluster。
看到 “Docker is running”, “Kubernetes is running” 两个绿点,万事俱备了。
可以在命令行环境里检查一下安装。Server 和 Client 的版本可能会不一样,如果安装了 gcloud sdk 里的 Kubernetes,current-context
也会不一样(取决于你的 server 在哪儿)
1 | $ kubectl version |
再看看 cluster,里面目前只有一个node。
1 | $ kubectl cluster-info |
下一步,给刚才的 Kubernetes cluster 安装一个 Dashboard,安装的过程也正是 Kubernetes 创建 deployment/services 的过程。Dashboard 的 kubernetes-dashboard.yaml 文件也是一个很好的学习例子,可以读一读。
1 | $ kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yamlsecret “kubernetes-dashboard-certs” createdserviceaccount “kubernetes-dashboard” createdrole “kubernetes-dashboard-minimal” createdrolebinding “kubernetes-dashboard-minimal” createddeployment “kubernetes-dashboard” createdservice “kubernetes-dashboard” created |
在 kubernetes-dashboard.yaml 中,我们注意到 metadata 部分的 namespace 是 kube-system,来看一下 kube-system 下有哪些 pods。
1 | $ kubectl get pods --namespace=kube-system |
和 Kuberenetes architecture 图一一对应:
每个 pod 一开始状态都是 ContainerCreating,小等几秒钟,就成功变成 Running。
成功运行后,我们开启 proxy server 从本地访问 Kubernetes API server:1
2$ kubectl proxy
Starting to serve on 127.0.0.1:8001
浏览器里访问 http://127.0.0.1:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy:
选 SKIP,就能看到 Dashboard 了:
在侧边栏点 Nodes,就能看到之前我们显示的 “docker-for-desktop”
注: 原文中用的是 kubectl port-forward
8443 端口,所以截图中地址栏都是 localhost:8443。原文使用的命令是:
1 | $ kubectl port-forward kubernetes-dashboard-845747bdd4-9fm69 8443:8443 — namespace=kube-systemForwarding from 127.0.0.1:8443 -> 8443 |
但通过搜索,貌似 kubectl proxy
的方法更为简洁和主流。
跑一个 Nginx container 来看看整个过程:
1 | $ kubectl run hello-nginx --image=nginx --port=80deployment “hello-nginx” created |
这行命令创建了一个 deployment,该 deployment 会创建一个 pod,pod 负责运行 container:
1 | $ kubectl get pods |
等几秒钟,再看 Dashboard 的 Deployments:
当 Pods 由 0/1 变成 1/1 的时候, 刚才的命令结果也变成 Running 了
1 | $ kubectl get pods |
点侧边栏的 Pod,再点我们刚创建的 Pod “hello-nginx-556b7bf96-2xw8f”,能看到这个 pod 的细节:
这里能看到给定的默认 labels,和分配的 IP (来此 docker-for-desktop Node)。
右上角的这几个按钮也是我们经常用的功能,EXEC 打开一个浏览器中的 shell ssh 进入 pod,LOGS 查看日志。
同样的,我们也可以通过 kubectl describe node/pod
命令来查看信息,如下:
1 | $ kubectl describe pod hello-nginx-556b7bf96-2xw8f |
之前我们提到过,可以用 port-forward pod 来实现外部访问:
1 | $ kubectl get pods |
但这只暴露特定 pod 的端口,在实际中并不实用。
所以我们换一个方法,把 deployment 暴露成一个服务,供外部访问:
1 | $ kubectl get deployment |
现在,浏览器里就能访问 localhost:30351 了。
注意到真正的 EXTERNAL-IP 都是 none,这是因为 Mac 安装的 Docker cluster 并没有 LoadBalancer, 只有云服务上才有。
]]>老文章。来自Peter Norvig 的用 Python 写个 lisp 解释器。代码和解释都很完美。记得有公司出面试题出类似的,如果用这文章里的思路解决就秒杀啊。
比较详细的介绍 DNS 的文章
一系列架构和大数据文章的集合,质量良莠不齐,偶尔也有好玩的。
最喜欢看这种调试的文章了,特别是这篇『在多线程和高并发环境下,如果有一个平均运行一百万次才出现一次的bug,你如何调试这个bug』,连思路带工具一路娓娓道来,非常值得一读。
渗透工具 cheatsheet,初入安全的朋友可能会用得到。
Rob Pike 谈论 Ken Thompson 给他的编程建议:
Ken taught me that thinking before debugging is extremely important. If you dive into the bug, you tend to fix the local issue in the code, but if you think about the bug first, how the bug came to be, you often find and correct a higher-level problem in the code that will improve the design and prevent further bugs.
从 Two Scoop of Django 上看来,有点类似写 app 的 best practices.
简单介绍什么是 Continuous Delivery 和 DevOps。
着重推荐算法文章的高质量知乎专栏
也是 debug 类的好网站。一个图片分享网站讲他们迁移到 HTTP/2 的性能提升和坑。
]]>软件开发中的马太效应,越是差劲的团队越缺乏长期规划,用老旧的技术,写 ad-hoc 的代码,优秀的工程师越容易离开(斜眼看我司)。
来此余晟的微信公共号(对,就是翻译正则表达式的那哥们)。很多程序员在学校和工作中编程,但不知道如何写好程序,貌似也没有学校教如何『写好程序』。写好程序绝不是编译通过,跑过测试,符合 coding style 那么简单。文中提到的『荣誉感』还是挺重要的,我写代码的时候就会想着对自己的每一行代码负责,通过 code review 看别人的代码质量也会潜移默化影响我对这个人的评价,和对待他的态度。
挺长的,从偏硬件的角度讲解内存的原理。前面几章在各个 OS 教科书上都能找到,后面的内容就比较贴近现实,看着很有意思。
记得 GoogleTechTalk 里著名的一集 How To Design A Good API and Why it Matters 么,这个 manual 可以看作那个 talk 的扩展读物。API Design 在程序设计中的重要性相比于架构不遑多让,看看这些 best practices 有益身心~~
一篇老文章,介绍 Instagram 早期时候(2012)的架构。我最喜欢这种早期架构的文章,看别人在高速增长期如何用快糙猛的方法解决问题。里面给的小部分技术在今天看来有些过时了,但大部分还很有借鉴意义。比如提到的 gunicorn 和 Fabric,是 Python web 开发中的标配(还有 supervisor);vmtouch (查了下发现是一个超级light weight的内存数据管理工具,代码也写的很棒);Munin,Pingdom 监控,Sentry 报告错误。
]]>这周 RSS 里没啥非常好的文章,就把以前的笔记里翻出来几篇凑数。
爱因斯坦的在专利局工作时候的 side project 是相对论论文,学校门卫波洛克(对,现代艺术甩泥点子那位)的 side project 是画画,Slack 一开始也是游戏公司的 side project 聊天工具。Side project 能真正有效促进一个人的专业进步(深有同感)。不是有句话说么,这年头在湾区没有一两个 side project 都不好意跟人打招呼。
压箱底的老文章。 知名博主 Matt Might的一篇长文,很详细的列出了一个合格的 CS 学生应该学习和掌握的知识,任何一个学 CS 的学生都应该看看这篇文章。
技术博客大多谈的是如何学习新技术,如何设计架构,如何找工作,却很少谈如何正确的 Code Review 的。Code Review 在工作中非常重要,是有效的学习/分享知识,增进办公室人际关系的途径。这一系列文章就很详细的谈了如何 Code Review。附赠这篇 Code Review Best Practices。
这是一篇我很喜欢的非技术文章。从顾客离奇的投诉『新车会对香草冰淇淋过敏』说起,分析了一些显式因果关系和其背后看似荒诞的理性依据。原文链接已经失效,给的链接是别的网站转的。
]]>这是经女朋友提醒,准备每周一篇,阅读的技术博客,文章等的简短评论,干货收集,读书感想及各种好玩的事儿。
一直觉得O’Reilly书系的示意图简洁美观,希望写博客或者presentation的时候也能用,今天花时间查了下,字体用的是 Myriad,编程字体用的是 Ubuntu Mono,用 Keynote 画了几个简单的图形,供日后使用。
每次面试时候最后面试官都会问 “Do you have any questions?”,然后大多数人都只是问一些隔靴搔痒的问题,这篇文章比较深入的探讨这个问题,怎样让问面试官问题成为你的加分点,哪些问题能反映出这个公司的开发流程和工程师文化(关系到是否值得加入),面试必备的好问 (Victoria 同学翻译成了中文)。
Cron通常用来每隔一段时间跑一个脚本,但大多数人都只是用来检查重启服务器或者备份。本文介绍了一些 Cron 的高级玩法,输出错误码,发送邮件,设置 timeout,等等。BTW,这个作者还有一些 Linux Crypto 的文章,也很值得一看。
介绍了一堆Performance monitoring & tunning 的工具,来自 Netflix senior performance architect。我查 systemtap 的时候翻到这哥们的博客,发现此人文章很棒,然后和我们司的performance engineer 聊文章中的问题,发现这博主是他当年在 Sun 的同事 … …
这个介绍不同存储器性能的表格历史悠久,这个 gist 及其评论里很多人贡献了他们知道的相关的文档和视频,很多都很有意思。一些常用的还是很有必要背下来的(比如 CPU, 内存,硬盘, SSD 读取速度的倍数关系),对于初学编程的人能更好的了解 locality 的重要性。
介绍了一些国内的案例,文末的链接也挺好的。
云风的文章和他的代码一样,一向以信噪比高著称。这是一篇谈博弈中的打分算法问题的,文中谈到elo就是电影《社交网络》中 Eduardo 在窗户上写的公式。
]]>一如既往的,新年是要计划的,实不实现再两说。
最近看到一个很有意思的面试题:给一个单词和一个字典,找出字典中所有和给定单词编辑距离不大于 k 的词。
一个常见的思路是遍历一遍,求单词和字典中每一项的编辑距离。我们知道编辑距离是二维 DP,时间复杂度为 $O(L^2)$,其中 L 为每个单词平均长度,则总时间复杂度为$O(NL^2)$, N 为字典中词的个数。
这个方法的问题在于,一旦查询单词变多,性能会很糟糕。基于知乎 Lee Shellay的回答,可以通过构造 Trie, 结合 DFS,来解决这个问题。
所以算法思路并不难:
check_fuzzy(trie, word, path, tol)
。trie
是在树中当前走到的节点,word
表示走到当前节点剩余需要处理的查询单词,path
表示走到当前节点已经记录的字典单词前缀,tol
表示剩余可容忍的编辑距离。然后定义一个set,不断找到可能的单词并入这个set,直到结束。tol
为0时候终止(为什么不是word
为空时候终止?因为有可用的编辑距离都用在增加后缀的情况)。 最后代码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37# Based on Lee Shellay's code http://www.zhihu.com/question/29592463
END = '$'
def make_trie(words):
trie = {}
for word in words:
t = trie
for c in word:
if c not in t:
t[c] = {}
t = t[c]
t[END] = {}
return trie
def check_fuzzy_v4(trie, word, path = '', tol = 1):
if tol < 0:
return set()
ps = set()
if word == '':
if END in trie:
ps = {path}
for k in trie:
# match current or mark as substition
ps |= check_fuzzy_v4(trie[k], word[1:], path+k, tol - (not word or k != word[0]))
# add random char
ps |= check_fuzzy_v4(trie[k], word, path+k, tol-1)
# delete one (if word is empty, word[2:] will not report error)
ps |= check_fuzzy_v4(trie, word[1:], path, tol-1)
return ps
if __name__ == '__main__':
words = ['hello', 'hela', 'hel', 'dokm', 'i', 'ke', 'ik']
t = make_trie(words)
print check_fuzzy_v4(t, 'helo','', tol=2)
然后试试大一点的数据。我们知道在/usr/share/dict/words
存着拼写检查的单词表,一共 2.4M 共 235886个单词(至少在我的 Mac 上是这么多)。可以用它来构造字典 cat /usr/share/dict/words > ./words.txt
。然后把一句话改的乱七八糟,用代码来跑跑试试:
1 | def test(): |
结果也挺快的:
就是这样, 喵~
PS: Lee Shellay回答又更新了,提升了性能和准确度,代码比我这的好,欢迎去看。
]]>As the name implies, coroutine refers to co-operative routine. It allows you to suspending and resuming execution at different locations. So, it’s essentially just context switching. Not surprisingly, coroutine is implemented in primitives like setjmp/longjump or ucontext in low level.
In many senarioes, coroutine is a more light-weight alternative of threads. For programming languages with GIL (like Python), coroutine would used to handle concurrency.
Let’s take a look at classic “producer-consumer” problem. At each time, one coroutine produce products and add them into queue, the other coroutine take products from queue and use it (hmm, sounds like video buffering, right?).
The code below assumes you already have some knowledge of generator.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28import time
import random
def coroutine(func):
# A wrapper to convert function into generator
# From David Beazley
def start(*args,**kwargs):
cr = func(*args,**kwargs)
cr.next()
return cr
return start
def producer(target):
while True:
time.sleep(1)
data = random.randint(1, 10)
print ("# producer: sending data {}".format(data))
target.send(data)
def consumer():
while True:
data = yield
print ("# consumer: receving data {}".format(data))
if __name__ == '__main__':
g_consumer = consumer()
producer(g_consumer)
Simple enough, send()
is a built-in function of generator. The producer send data to consumer, consumer receives data from yield
.
Yes, the famous concurrency library gevent is based on coroutine.
PEP 342: Coroutines via Enhanced Generators
General concepts: concurrency, parallelism, threads and processes
]]>Many people believe that decorator is one of the obscure concepts in Python.
Trust me, it is not. To be short, a decorator is a function that modifies other
functions via closures.
They are plenty detailed articles about what decorator it is so there is no need to write one more. If you are not familiar with it, you may want to check these:
In this article, I am going to use a simple but interesting example to show
OK, let’s rock.
There is a set of escape sequences used to change the color of texts. So if we want to colorize a sentence, we just need to put the sentence between the color escape sequence and reset escape sequence. For example:1
2
3
4
5
6
7>> ORANGE = '\033[33m'
>> RED = '\033[31m'
>> GREEN = '\033[32m'
>> BLUE = '\033[34m'
>> RESET = '\033[0m'
>> print ORANGE + "Chinese New Year" + RESET
>> print GREEN + "Chinese" + GREEN + "New" + BLUE + "Year" + RESET
You will see
Chinese New Yearand
Chinese New YearAs I said, Python does not support switch case. So we cannot switch the color name and choose the corresponding escaped sequence. Fortunately, dictionary would do the work.
1 | def getColor(color): |
The trick is, the dictionary’s built in get
method. The first parameter here is key, the second optional parameter is default. As the docstring shows:
D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
Suppose we have a function to implement a task. It may has three kind of return string, if task is completed successfully, it returns “SUCCESS: blah blah …”, if the task is finished but we cannot ensure it’s corretness, it returns “WARNING: blah blah …”, if task failed, returns “ERROR: blah blah …”, how do we colorize these return strings?
Raw code1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37def colorize(*args):
def getColor(color):
return {
'black' : '\033[30m',
'red' : '\033[31m',
'green' : '\033[32m',
'orange' : '\033[33m',
'blue' : '\033[34m',
'purple' : '\033[35m',
'cyan' : '\033[36m',
'light_grey' : '\033[37m'
}.get(color, "")
def _colorize():
def wrapper():
RESET = '\033[0m'
text = func()
#if not isinstance(text, basestring):
# text = str(text)
level = text.split(':', 1)[0]
color = {
'SUCCESS': args[0],
'WARNING': args[1],
'ERROR': args[2]
}.get(level, '')
return "{0}{1}{2}".format(getColor(color), text, RESET)
return wrapper
return _colorize
def do_task():
# working working ....
if (success):
return "SUCCESS: Yeah~~"
elif (warning):
return "WARNING: wait, what?"
else:
return "ERROR: something went wrong here."
As you can tell, to make the decorator with parameters, we need to put it in another decorator.
Decorator are often used as cache, profiler, logger, synchronization(acquire lock, drop lock) and so forth. One of my favourite library Click is also a wonderful example.
Happy Chinese New Year ~
]]>在大部分链表题中,我们习惯于创建一个空节点dummy,使之指向链表的头结点,以方便对
第一个节点进行操作(比如,删除它)。最后答案返回dummy.next。比较有节操的同学会在
删除链表的某些节点时用delete,以免内存泄露,但是难道就没有考虑过dummy节点感受么?
使用一个二维指针,可以优雅的解决了这个问题。
举个简单的🌰:
1 | /* |
尝试解释一下下。p
是一个二级指针,也就是说,在一开始,p
是一个指向一个指向head
指针的指针(也就是(*p)
指向head
)。这样的好处就是,当我们需要在某一个时刻删除指向的节点(delete *p
操作),p本身不受影响(当然,是p指向的指针所对应的内存空间被释放了)。唯一一点不方便的时,其他每次移动的时候,都要用(*p)
(p淡淡的看着他指向的指针往后移。)
和用dummy解法不同的是,dummy解法指针后移是ptr = ptr->next;
。那我们这呢?(*p) = (*p)->next;
?这样是错的。比如1->2
里从1
移动到2
的过程中,就把节点1
修改了。所以,要移动的是p。即为p = &(*p)->next
,其中->
的优先级是高于&
的,把p赋值为(*p)->
的地址,所以现在(*p)
指向老(*p)
的next。
另一个大家可能关心的问题,在delete (*p)
后,(*p)
的前驱节点的next是怎么不找丢的呀?这其实涉及到delete
的本质(Stackoverflow对这个问题有个不错的回答)。当我们调用delete
的时候,那块内存里的数据其实并没有消失,只是这块内存地址被标记为可以利用,当之后的程序需要new的时候,才有可能覆盖掉这里的数据。就像爱情,没有一段新的覆盖,老的怎么忘的掉(情人节了还在改博客,唉~~)。所以这个代码严格意义上说是由风险的,如果在delete的一瞬间,正好另一个程序/进程new了一块内存,又刚好是这里,这个方法就废了。fix的方法就是delete前,赋给一个临时变量,把next覆盖当前,再delete临时变量。
修改的过程中,发现陈皓也写过类似的文章,这个trick被Linus举例为什么才是core low-level coding,真正懂指针的做法。 他的文章还有配图,如果我表述的还是没让大家理解,推荐去读一下。
PS:《Pointers In C》的第十二章《Using Structures and Pointers》,也有关于指针链表操作的详细解释。
]]>你看,程序员界就很少有这么高的境界,去中关村随便抓个程序员,问他为什么编程,他就绝不会回答 “为公司立心,为开源立命,为community开太平”。当然,也有一些身怀抱负的有为青年(比如我),会不时纠结于这个问题。比如最近,我就为自己的志大才疏感到无比痛苦。成功学那些骗小孩儿的早就免疫了,得靠一些真实的数据来打鸡血。
首先,优秀的技术大牛是什么样的?我把我RSS订阅里的中文技术博客博主的信息都翻了一遍,统计出了一些有趣的东西:
所以,我还不是完全没有机会的。我之所以喜欢这一行,就是因为这一行的聪明人太多了,勤奋的人太多了,这让我很有压力,这种压力很爽。
正如莱蒙托夫在的那首小诗《帆》:
在大海的深蓝色的云雾里,
一只孤独的帆儿闪着白光。
它在寻求什么,在那遥远的异地?
它抛下了什么,在那自己的故乡?
波涛在汹涌着,海风在呼啸着,
桅杆弓起腰来发出扎扎的声响。
不,它不是在寻求幸福,
它也不是在逃避幸福!
它下面是碧色的澄清的水流,
它上面是金色的太阳,
而它,不安地,在祈求着风暴,
仿佛是在风暴中才有安详。
2015年都过了一个月了。现在才开始写新年计划有点晚了,可晚比没有强。
规划一年是个很难的事儿,所以我就先计划一小部分,并附上deadline
(deadline driven development 是最有生产力的)。
暂时就想到这么多,wilbeibi,加油!
]]>OpenSSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other attacks. Additionally, OpenSSH provides secure tunneling capabilities and several authentication methods, and supports all SSH protocol versions.
And I believe ssh
is one of the most used commands for programmers (Windows users, you have putty, that’s … not bad). In this post I am going to list some most basic usage of ssh
.
They are two ways to identify users: via password and via key pair. The latter one is more secure. We can generate a key pair through:
$ ssh-keygen -t rsa -C "your_email@example.com"# Creates a new ssh key, using the provided email as a labelGenerating public/private rsa key pair.# Enter file in which to save the key (/Users/you/.ssh/id_rsa): [Press enter]
where -t
stands for encryption type, -C
for comment. Then choose a strong passphrase (in case of your rsa keys being stolen). Now, you will see id_rsa (private key) and id_rsa.pub (public key)in your ~/.ssh/
directory(Don’t let others know your private key).
At last, add your key to ssh-agent
(a keys management tool):
eval "$(ssh-agent -s)"ssh-add ~/.ssh/id_rsa
Now it’s time use our public key. For Linux user:
ssh-copy-id user@machine
For Mac user, we can either brew install ssh-copy-id
to install and use this command or typing:
cat ~/.ssh/id_rsa.pub | ssh user@machine "mkdir ~/.ssh; cat >> ~/.ssh/authorized_keys"
As you can tell, actually what we are doing here is copy the content of id_rsa.pub to server’s ~/.ssh/authorized_keys
.
Laziness is a great virtue of a programmer. Add these to local ~/.ssh/config
(create if not exist) to simplify your life:
Host matrix HostName <domain name or public IP> User <user name> IdentityFile </path/to/private_key>
One more thing, ssh config support wildcard, so you can use
Host *compute-1.amazonaws.com
for all your ec2 instances. I also added
TCPKeepAlive=yesServerAliveInterval=15 ServerAliveCountMax=6StrictHostKeyChecking=noCompression=yesForwardAgent=yesRSAAuthentication=yesPasswordAuthentication=yes
in my config file to add more feature.
Use uname
to get system information:
uname -a
-a
for print all the information.
nproc
to print the number of processing units available (GNU coreutils):$ nproc4
lscpu
to display CPU architecture information (util-linux).df
is a powerful command for displaying system disk.df -h /path/to/directory
cat /proc/partitions/
and cat /proc/mounts
are also pretty handly solutions to check the partitions and mount disks.Just as same as disk, cat /proc/meminfo
could easily check memory information (Thanks to Unix’s Everything is a file design concept).
Alternatively, you can type free -m
, which essentially is the same as check /meminfo. -m
for display in megabytes (as you expected, -g for gigabytes, -k for kilobytes.)
last
command will display user’s info like terminal, time, date and so forth. To check one specific user’s activity, last username
is what you are looking for.w
is a great but rarely know command. It will display who is logged on and what they are doing. It’ll show username, terminal, from IP, login time, idle time, JCPU and the command line of their current process. If you never heard it before, I strongly suggest you to have a try.
uptime
: Tell how long the system has been running.
ps
: a well known command for checking current processes, for instance, to list all zombie process:
ps aux | awk '{ print $8 " " $2 }' | grep -w Z
where ps aux
to show processes for all users, the process’s user, and also show the processes not attached to a terminal (check man page for more details), then awk to filter the STAT
and PID field, use grep to select the line contains Z
(Zombie), now we get zombie processes pids. It’s easy to kill them by kill -9 {PID}
.
top/htop
: Better not to use non-builtin command(for security reasons), but if you do want to, htop
is a superior alternative to top
– dynamically display current tasks.To get your own public IP, both curl icanhazip.com
or curl ifconfig.me
are easy ways to do that(previous one is much faster).
ping
: Even my mother knows to use ping
to check network connectivity.
ifconfig
: A frequently used tool to view network interface information. BTW, I wrote a script to filter IP, MAC addresses and networks from ifconfig
(tested on Ubuntu, Fedora, OmniOS and FreeBSD).
lsof
, aka list open files, is definitely a swiss army knife for analyzing network. lsof -i
for list all open Internet and X.25 network files. (The examples below are from Daniel Miessler’s blog, see reference)
lsof -iTCP # Show only TCP connectionslsof -i:80 # Show networking only relate to port 80lsof -i@107.170.181.47 # Show connections with particular IPlsof -u username # Show given user's connectionslsof -u ^username # Show connections except given user
ss -s
: display Currently Established, Closed, Orphaned and Waiting TCP socketsBy useradd:
useradd -m -d <HomeDir> -g <Group> username
It’s optional to specify the new user’s home directory and group, but I strongly suggest to do so. -m
stands for create home, -d
to allocate a directory. (Warning, don’t mess up useradd
and adduser
, the later one is a higher level’s implementation. Here is a detailed explanation of these two’s differences.)
By groupadd:
groupadd groupname
By usermod:
usermod -a -G
where usermod
means modify a user account, -a
stands for append, append this user to a group.
Well, there is not such a built-in command for that, but we can use:
grep '^groupname' /etc/group
or apt-get install members
, then
members groupname
Sticky bit is used for directories. As wikipedia said:
When the sticky bit is set, only the item’s owner, the directory’s owner, or root can rename or delete files. Without the sticky bit set, any user with write and execute permissions for the directory can rename or delete contained files, regardless of owner.
For example, if the professor create a /homework directory with sticky bit, every student can upload their homework, but they cannot rename or delete other students’ homework.
chmod +t /path/to/directory
or
chmod 1755
where 1 stands for sticky bit, 7 for owner has all privilege, 5 for read and execute privilege for the group, and for others.
Now, /path/to/directory should looks like this (replaced last character):
drwxr-xr-t 1 root other 0 Nov 10 12:57 test
As wikipedia said, if the sticky-bit is set on the directory without the execution bit set for the others category, it is indicated with a capital T:
drwxr-xr-T 1 root other 0 Nov 10 12:57 test
One sentence explanation: Regardless of who runs this program, run it as the user who owns it, not the user that executes it.
chmod u+s /path/to/file
For instance, a simple shell script showfile.sh
has set setuid as root privilege:
#!/bin/sh# showfilels -l | sort
And If I am a bad guy, I could easily write script :
rm -rf /some/where/important
and saved as name ls
, add my ls
to the front of $PATH. Now when I tried to run showfile.sh, Boom ! The files are deleted.
If you found grammar errors or typos, please feel free to help me correct it.
]]>git add filegit commit -m "Aha, file modified"
Or, just type
git commit -am "Aha, file modified"
After that, push to remote repository:
git push origin branch_name
So, what’s difference between these two? I will reach to that later.
It’s a good practice to fix a wrong commit rather than make a new commit.
So, first, edit the file with the problem, make the corrections, then:
git add now_right_filegit commit --amendgit push --force branch_name # Warning!
Be careful, The --force
is dangerous, it works fine to me for 99% cases, but it dose have potential harmness, and that’s why Linus doesn’t recommend it.
There are two ways of delete files, delete locally and commit to remote repository, or just directly delete files in remote repository, like:
git rm --cached file_to_delete
Even better, you can delete all the files match a certain glob:
git rm --cached 'merge-*' # delete all the files start with "merge-"
There already has an excellent and well accepted answer on StackOverflow, it’s way much better than my explanation –> link:
In the simplest terms,
git pull
does agit fetch
followed by agit merge
.
You can do agit fetch
at any time to update your remote-tracking branches underrefs/remotes/<remote>/
. This operation never changes any of your own local branches underrefs/heads
, and is safe to do without changing your working copy. I have even heard of people running git fetch periodically in a cron job in the background (although I wouldn’t recommend doing this).A
git pull
is what you would do to bring a local branch up-to-date with its remote version, while also updating your other remote-tracking branches.
git pull # will auto merge unconflicted partgit status # check the information of conflicted files
Use your favorite editor to edit the conflicted file with “<<<<<<” and “>>>>>>”, save it, commit it, that’s all.
git checkout latest_branchgit merge -s ours to_overwrite_branch
What’s the ours
means here? It’s a merge strategy, you can find it in git checkout doc:
git checkout [--ours| theirs] branch --ours --theirs
When checking out paths from the index, check out stage #2 (ours) or #3 (theirs) for unmerged paths.
The index may contain unmerged entries because of a previous failed merge. By default, if you try to check out such an entry from the index, the checkout operation will fail and nothing will be checked out. Using -f will ignore these unmerged entries. The contents from a specific side of the merge can be checked out of the index by using –ours or –theirs. With -m, changes made to the working tree file can be discarded to re-create the original conflicted merge result.
git branch -d died_branchgit push origin --delete die_branch # or git push origin :died_branch
git reflog show # find revision hashgit checkout revision_hash .
I will explain this a little bit. git reflog show
gives us a list of all the commits and their hashes. Then, checkout that specific hash.
Read more:
Many people always ask how to combine git add
and git commit
in one command, and the most answered solution is git commit -a -m "blah blah"
.
Yes and no. For the files which have been git add
before, git commit -a
will do the git add
for you. But for rest files(aka untracked files), we have to the git add
. If you really want to save the time for these tedious work, alias is what you are looking for.
In .gitignore:
# Ignore everything*!except_script.sh
This will ignore everything but except_scrpit.sh.
Once my silly cat was dancing on my keyboard after a commit mess up all the files!
Luckily, we can use
git reset hard --HEAD^
to revert to the previous commit.
Or, I wrongly git add should_not_add_file
, we can also use
git reset HEAD should_not_add_file
to upstage that file.
Stolen from Stackoverflow again
git clone -b <branch> <remote_repo>
Example:
git clone -b my-branch git@github.com:user/myproject.git
Alternative (no public key setup needed):
git clone -b my-branch https://git@github.com/username/myproject.git
]]>No, that’s little bit heavy for this project. So, what’s alternative choices? Dropbox! Dropbox maybe the easiest way to share folder (wait, you means rsync
? Dropbox did a lot of algorithm improvement to ensure it’s higher speed of syncing)
But, something weird happens. My Webstorm automatically changed layout views. That because he is also using Webstorm, and In each project, Webstorm use a .idea/
directory to save specific settings (as the document below said).
Project settings are stored with each specific project as a set of xml files under the .idea folder. If you specify the default project settings, these settings will be automatically used for each newly created project.
.gitginore
like file in Dropbox?Sure. Of course it’s not as powerful as .gitginore
. In dropbox -> Preference -> Account -> Change Setting, unclick .idea
folder, that’s all.
Also, I strongly suggest unclick node_modules
folder. It takes Dropbox too much time to synchronize a bunch of small pieces of files.
And if you sometimes use Emacs, to avoid annoying temporary files (but some time really save you ass), the only way I know is to add this in .emacs
file.
(setq make-backup-files nil)
Please feel free to correct my typos or grammar.
]]>There already has some comparison about pros and cons of each library. As lxml document said:
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superior support for encoding detection. It very much depends on the input which parser works better.
… …
The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.
In short: lxml is faster when parsing well-formed web page.
This is a common scenario. First get links of each entries in a index
page.
For example, find all housing in http://losangeles.craigslist.org/hhh/index.html. In Chrome, Inspect Element, get XPath link from one link:
The xpath is /*[@id="toc_rows"]/div[2]/p[1]/span[2]/a/@href
, from p[1] to p[100]. Save these links to a file crag_link.txt
.
from lxml import htmlimport requestswith open('crag_link.txt', 'a') as f: for i in range(0, 1000, 100): pg = 'http://losangeles.craigslist.org/hhh/index' + str(i) + '.html' src = requests.get(pg) if src.status_code == 404: sys.exit(1) tree = html.fromstring(src.text) print 'Get page', i for j in range(1, 100+1): x_link = '//*[@id="toc_rows"]/div[2]/p[' + str(j) + ']/span[2]/a/@href' links = tree.xpath(x_link) for ln in links: f.write( 'http://losangeles.craigslist.org' + ln + '\n') f.close()
Click into one of the page, for instance, we want to get post id, copy xpath
like //*[@id="pagecontainer"]/section/section[2]/div[2]/p[1]
. According to XPath syntax, these path add suffix /text()
is what we need.
try: post_id = tree.xpath('//*[@id="pagecontainer"]/section/section[2]/div[2]/p[1]/text()')except: # Handle Error
The reason we add try/catch block here is to prevent missing data. Wait a second, what if we have 30 attribute to scrape, do we need to write try/catch 30 times. Definitely no. Wrap them into a function might be a good idea. BTW, hardcode xpath into program is not a good idea, by writing a function, we can pass it as a parameter(Or even better, store attribute names and xpaths in a dictionary).
def get_attr(tree, xps): return attr_name = tree.xpath(xps)''' xps_dict look like: {'post_id':'//*<somehing>/p[1]/text()','post_time':'//*<somehing>/p[1]/text()'}'''for a, x in xps_dict.iteritems(): attr[a] = get_attr(tree, x)
For the Part 2, I will carry on, talk about encoding problem, prevent duplicates and so forth.
]]>