当前位置:首页 > 行业动态 > 正文

python停用词表 更新热词表

Python停用词表已更新,现在包含了最新的热门词汇。这些词汇在文本分析中可能会影响结果的准确性,因此需要被排除在外。

Python停用词表更新热词表

python停用词表 更新热词表  第1张

1. 获取停用词表

我们需要从网上下载一个中文停用词表,这里我们使用jieba库的内置停用词表。

import jieba
获取停用词表
stopwords = set(jieba.analyse.stop_words)

2. 读取文本数据

我们需要读取文本数据,这里我们假设文本数据存储在一个名为text_data.txt的文件中。

with open('text_data.txt', 'r', encoding='utf8') as f:
    text = f.read()

3. 分词并去除停用词

使用jieba库对文本进行分词,并去除停用词。

import jieba.posseg as pseg
分词并去除停用词
words = [word for word, flag in pseg.cut(text) if word not in stopwords]

4. 统计词频

使用collections库中的Counter类统计词频。

from collections import Counter
统计词频
word_freq = Counter(words)

5. 更新热词表

将统计出的词频按照降序排列,取前N个作为热词。

更新热词表
hotwords = word_freq.most_common(N)

6. 输出热词表

将热词表输出到文件。

输出热词表
with open('hotwords.txt', 'w', encoding='utf8') as f:
    for word, freq in hotwords:
        f.write(f'{word}: {freq}
')

至此,我们已经完成了Python停用词表的更新热词表操作。

以下是一个简单的介绍,包含了两列:一列是Python停用词表,另一列是更新热词表。

停用词表 更新热词表
a 新冠干扰
about 疫情
above 云计算
after 5G
again 人工智能
all 大数据
almost 区块链
along 芯片
also 无人驾驶
always 虚拟现实
among 生物技术
an 量子计算
and
any
are
as
at
be
because
been
before
being
below
between
both
but
by
can
could
did
do
does
doing
down
during
each
few
for
from
further
had
has
have
having
he
her
here
hers
herself
him
himself
his
how
however
i
if
in
into
is
it
its
itself
just
kg
km
lb
left
like
ln
ltd
m
mg
might
ml
mm
more
most
mr
mrs
ms
much
must
my
myself
n
no
nor
not
of
off
often
on
once
only
or
other
our
ours
ourselves
out
over
own
part
per
perhaps
put
rather
re
s
same
she
should
since
so
some
such
t
than
that
the
their
theirs
them
themselves
then
there
these
they
thick
thin
this
those
through
to
too
under
until
up
very
was
we
well
were
what
when
where
which
while
who
whom
why
with
within
without
would
yet
you
your
yours
yourself
yourselves

请注意,停用词表是英文的,而更新热词表是中文的,这个介绍仅作为示例,实际上停用词表和热词表的内容可以根据实际需求进行调整,停用词表通常包含一些常见的、没有实际意义的单词,而热词表则包含当前热门的话题或关键词。

0