Python help !

sp3co · Jan 9, 2018

Python වැඩ්ඩොන්ගෙන් උදව්වක් ඕනේ ..

UCSC Sinhala Corpus + NLTK project කරපු කවුද ඉන්නේ ? මම UCSC corpus එක use කරනවා project එකකට. එකේ NLTK වලින් custom corpus open කරන විදියට මේකත් open කරන්න බැලුවේ (NLTK functions use කරන්න නිසා) එත් unicode error එකක් එනවා මේ විදියට

Traceback (most recent call last):
File "/home/xxxx/PycharmProjects/testing/readcorpus.py", line 13, in <module>
file = read_file.read()
File "/home/xxxx/.virtualenvs/PycharmProjects/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

මගේ code එක මේක ... මේකේ readpath එකත් හරි ..ඒත් ඒ file එක read කරන්න යද්දී තමයි error එක එන්නේ

from nltk.corpus import PlaintextCorpusReader

corpus_root = './resources/corpus/UCSC-Sinhala-News-Corpus/UCSC-Sinhala-News-Corpus-V1'

sinhala_corpus = PlaintextCorpusReader(corpus_root, '.*')

print(sinhala_corpus.fileids())

readpath = './resources/corpus/UCSC-Sinhala-News-Corpus/UCSC-Sinhala-News-Corpus-V1/News Corpus_V1/NPED0001.TXT'

read_file = open(readpath, 'r', encoding='utf-8')
file = read_file.read()

මෙහෙම වෙන්නේ ඇයි මේක හදාගන්නේ කොහොමද ?

ඔය text file එක notepad open කලාම type එක තියෙන්නේ unicode.. හැබැයි එක utf-8 කියල save කලාම error එක එන්නේ නෑ file එක read වෙනවා.. එහෙම file type වෙනස් නොකර මේක හදාගන්නේ කොහොමද ?

Stackoverflow එකේ බැලුව unicode file read කරන්නේ කොහොමද කියල.. ඒකෙ තියෙන්නෙ ඔය විදියට කලාම හරි කියලා

CloudX64 · Jan 9, 2018

sp3co · Jan 9, 2018

JokerFan said:
up

thanks ban

CloudX64 said:

thanks ban

Mr_Bee said:
BUMP

thanks ban

ubalata denna rep na ...

sorry

owlX · Jan 10, 2018

Code:

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()

Try this one.. i had same issue year before

owlX · Jan 10, 2018

Code:

# -*- encoding: utf-8 -*-
import codecs
import re

def get_words(filepath):
    f = codecs.open('sample3.txt', 'r', 'UTF-16')
    return re.sub(r'\s+', ' ', f.read())

words_to_string = get_words('sample3.txt')
words_to_list = words_to_string.split()

word_dic = {}

#Getting words into dic and count the words
for i in xrange(0,len(words_to_list)-1):
	word_dic.update({words_to_list[i] + " " + words_to_list[i+1] : words_to_string.count(words_to_list[i] + " " + words_to_list[i+1]) })

for i in xrange(0,len(word_dic)):
	# Output will also save to : output.txt
	with codecs.open("output.txt", "a", encoding="utf-8") as myfile:
		myfile.write(word_dic.keys()[i] + " : " + str(word_dic.values()[i]) + "\n")
	print word_dic.keys()[i] + " : " + str(word_dic.values()[i])

code eka gana blna epa gna puluwan deyak thiynwda balapan

kolavari · Jan 10, 2018

Udinma encoding eka define karala baluwr nadda?

sp3co · Jan 10, 2018

owlX said:

Code:

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()

Try this one.. i had same issue year before

owlX said:

Code:

# -*- encoding: utf-8 -*-
import codecs
import re

def get_words(filepath):
    f = codecs.open('sample3.txt', 'r', 'UTF-16')
    return re.sub(r'\s+', ' ', f.read())

words_to_string = get_words('sample3.txt')
words_to_list = words_to_string.split()

word_dic = {}

#Getting words into dic and count the words
for i in xrange(0,len(words_to_list)-1):
	word_dic.update({words_to_list[i] + " " + words_to_list[i+1] : words_to_string.count(words_to_list[i] + " " + words_to_list[i+1]) })

for i in xrange(0,len(word_dic)):
	# Output will also save to : output.txt
	with codecs.open("output.txt", "a", encoding="utf-8") as myfile:
		myfile.write(word_dic.keys()[i] + " : " + str(word_dic.values()[i]) + "\n")
	print word_dic.keys()[i] + " : " + str(word_dic.values()[i])

code eka gana blna epa gna puluwan deyak thiynwda balapan

thanks machan. man balannam..

kolavari said:
Udinma encoding eka define karala baluwr nadda?

udinma kiuwe kothaninda ? file eka open karana thana define karala ne tiyenne

sp3co · Jan 10, 2018

මචන්ලා වැඩේ හරි . Unicode කියන්නේ utf-16 නේද ?

encoding = 'utf-16' කරපු ගමන් වැඩේ හරි ගියා

kolavari · Jan 11, 2018

sp3co said:
මචන්ලා වැඩේ හරි . Unicode කියන්නේ utf-16 නේද ?

encoding = 'utf-16' කරපු ගමන් වැඩේ හරි ගියා

Unicode kiyanne utf-8 / 16..depends... :yes:

Search

Latest ads

Python help !

sp3co

Well-known member

CloudX64

Well-known member

sp3co

Well-known member

owlX

Well-known member

owlX

Well-known member

kolavari

Well-known member

sp3co

Well-known member

sp3co

Well-known member

kolavari

Well-known member

Similar threads