BASH编程: 计算一个文本文件中每个单词的频率

LINUX 下的 SHELL 是很强大的编程工具(环境). 这里有一个例子. 在力扣/leetcode编程网站上有这么一题.

写bash脚本来计算一个文本文件中每个单词的频率 words.txt.

为了简单起见,你可以假设:

words.txt只包含小写字符和空格”字符.
每个字必须由只小写字符.
字由一个或多个空格字符分隔.
例如,假设words.txt具有以下内容:

the day is sunny the the
the sunny is is
您的脚本应该输出以下,并按降序频率:
the 4
is 3
sunny 2
day 1
注意:
不要担心处理的关系,可以保证每个单词的频率计数是独一无二的.

当然你可以完全用 BASH SHELL来写一个几行的脚本但是其实只需要通过管道就能把多个命令的结果利用起来一行就可以解决问题了.

方案- cat, tr, awk, sort

1	cat words.txt \| tr -s ' ' '\n' \| awk '{nums[$1]++}END{for(word in nums) print word, nums[word]}' \| sort -rn -k2

cat words.txt | tr -s ' ' '\n' | awk '{nums[$1]++}END{for(word in nums) print word, nums[word]}' | sort -rn -k2

方案 – grep, sort, uniq, sort, awk

1	grep -oE '[a-z]+' words.txt \| sort \| uniq -c \| sort -r \| awk '{print $2" "$1}'

grep -oE '[a-z]+' words.txt | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案, sed, grep, sort, uniq, sort, awk

1	sed -r 's/\s+/\n/g' words.txt \| grep -v "^$" \| sort \| uniq -c \| sort -r \| awk '{print $2" "$1}'

sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案 – awk and sort

1	awk '{words[$1]+=1} END{for(word in words){print word,words[word]}}' RS="[ \n]+" words.txt \| sort -nrk2

awk '{words[$1]+=1} END{for(word in words){print word,words[word]}}' RS="[ \n]+" words.txt  | sort -nrk2

方案 cat and awk

1	cat words.txt \| awk '{for(i=1;i<=NF;++i) { arr[$i]++; } } END { x=0; for(var in arr) {newarr[arr[var]]=var; if(arr[var]>x) x=arr[var];} for(i=x;i>0;--i) if (newarr[i] > 0) print newarr[i] " "i; }'

cat words.txt | awk '{for(i=1;i<=NF;++i) { arr[$i]++; } } END { x=0; for(var in arr) {newarr[arr[var]]=var; if(arr[var]>x) x=arr[var];} for(i=x;i>0;--i) if (newarr[i] > 0) print newarr[i] " "i; }'

方案 – tr, sort, uniq, sort, awk

1	tr -s ' ' '\n' < words.txt\|sort\|uniq -c\|sort -nr\|awk '{print $2, $1}'

tr -s ' ' '\n' < words.txt|sort|uniq -c|sort -nr|awk '{print $2, $1}'

方案 – sed

1	cat words.txt \| tr -s '[[:space:]]' '\n'\| sort \| uniq -c \| sort -r \| sed -r -e 's/[[:space:]]([[:digit:]]+)[[:space:]]([[:alpha:]]+)/\2 \1/g'

cat words.txt | tr -s '[[:space:]]' '\n'| sort | uniq -c | sort -r | sed -r -e 's/[[:space:]]*([[:digit:]]+)[[:space:]]*([[:alpha:]]+)/\2 \1/g'

LINUX 命令行下有句名言: Where there is a shell, there is a way. Share on X

fork-bomb

命令拆解

上面几种方案都有一些类似. 最重要的第一步就是把文件里的单字给分离出来

1	sed -r 's/\s+/\n/g' words.txt

sed -r 's/\s+/\n/g' words.txt

或者:

1	cat words.txt \| tr -s ' ' '\n'

cat words.txt | tr -s ' ' '\n'

或者:

1	grep -oE '[a-z]+' words.txt

grep -oE '[a-z]+' words.txt

这些命令都会显示出单词:

the
day
is
sunny
the
the
the
sunny
is
is

the
day
is
sunny
the
the
the
sunny
is
is

然后我们可以通过 grep -v “^$” (-v 反向选择) 去掉空行. 然后排序一下就把相同的单词放一起了.

1	sed -r 's/\s+/\n/g' words.txt \| grep -v "^$" \| sort

sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort

输出:

day
is
is
is
sunny
sunny
the
the
the
the

day
is
is
is
sunny
sunny
the
the
the
the

通过命令 uniq -c 可以显示每个单词出现的次数:

      1 day
      3 is
      2 sunny
      4 the

      1 day
      3 is
      2 sunny
      4 the

你可以再加一个管道或者把之前 sort 命令按倒序 -r 参数.

      4 the
      3 is
      2 sunny
      1 day

      4 the
      3 is
      2 sunny
      1 day

最后只需要把结果导出到 awk 然后按空格读列把相应的列输出就可以了.

1	awk '{print $2" "$1}'

awk '{print $2" "$1}'

输出:

the 4
is 3
sunny 2
day 1

the 4
is 3
sunny 2
day 1

BASH小技巧

英文: Shell Coding Exercise: Word Frequency

GD Star Rating
loading...

本文一共 391 个汉字, 你数一下对不对.

BASH编程: 计算一个文本文件中每个单词的频率. (AMP 移动加速版本)
上一篇: WordPress 最简单的过滤垃圾评论的方法
下一篇: 通过 PHPQuery 抓取 Tumblr 3000 多张图片

赞赏我的几个理由.

¥ 打赏支持

扫描二维码，分享本文到微信朋友圈