Python help !

sp3co

Well-known member
  • Sep 22, 2016
    1,495
    425
    83
    Katubedda
    Python වැඩ්ඩොන්ගෙන් උදව්වක් ඕනේ ..

    UCSC Sinhala Corpus + NLTK project කරපු කවුද ඉන්නේ ? මම UCSC corpus එක use කරනවා project එකකට. එකේ NLTK වලින් custom corpus open කරන විදියට මේකත් open කරන්න බැලුවේ (NLTK functions use කරන්න නිසා) එත් unicode error එකක් එනවා මේ විදියට

    Traceback (most recent call last):
    File "/home/xxxx/PycharmProjects/testing/readcorpus.py", line 13, in <module>
    file = read_file.read()
    File "/home/xxxx/.virtualenvs/PycharmProjects/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte


    මගේ code එක මේක ... මේකේ readpath එකත් හරි ..ඒත් ඒ file එක read කරන්න යද්දී තමයි error එක එන්නේ

    from nltk.corpus import PlaintextCorpusReader

    corpus_root = './resources/corpus/UCSC-Sinhala-News-Corpus/UCSC-Sinhala-News-Corpus-V1'

    sinhala_corpus = PlaintextCorpusReader(corpus_root, '.*')

    print(sinhala_corpus.fileids())

    readpath = './resources/corpus/UCSC-Sinhala-News-Corpus/UCSC-Sinhala-News-Corpus-V1/News Corpus_V1/NPED0001.TXT'

    read_file = open(readpath, 'r', encoding='utf-8')
    file = read_file.read()



    මෙහෙම වෙන්නේ ඇයි මේක හදාගන්නේ කොහොමද ?

    ඔය text file එක notepad open කලාම type එක තියෙන්නේ unicode.. හැබැයි එක utf-8 කියල save කලාම error එක එන්නේ නෑ file එක read වෙනවා.. එහෙම file type වෙනස් නොකර මේක හදාගන්නේ කොහොමද ?


    Stackoverflow එකේ බැලුව unicode file read කරන්නේ කොහොමද කියල.. ඒකෙ තියෙන්නෙ ඔය විදියට කලාම හරි කියලා
     

    CloudX64

    Well-known member
  • Nov 26, 2014
    9,383
    10,301
    113
    Winterfell
    asdasdasdas.png
     

    owlX

    Well-known member
  • Jul 13, 2014
    1,321
    400
    83
    /usr/bin
    Code:
    # -*- encoding: utf-8 -*-
    
    # converting a unknown formatting file in utf-8
    
    import codecs
    import commands
    
    file_location = "jumper.sub"
    file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
    
    file_stream = codecs.open(file_location, 'r', file_encoding)
    file_output = codecs.open(file_location+"b", 'w', 'utf-8')
    
    for l in file_stream:
        file_output.write(l)
    
    file_stream.close()

    Try this one.. i had same issue year before
     

    owlX

    Well-known member
  • Jul 13, 2014
    1,321
    400
    83
    /usr/bin
    Code:
    # -*- encoding: utf-8 -*-
    import codecs
    import re
    
    def get_words(filepath):
        f = codecs.open('sample3.txt', 'r', 'UTF-16')
        return re.sub(r'\s+', ' ', f.read())
    
    words_to_string = get_words('sample3.txt')
    words_to_list = words_to_string.split()
    
    word_dic = {}
    
    #Getting words into dic and count the words
    for i in xrange(0,len(words_to_list)-1):
    	word_dic.update({words_to_list[i] + " " + words_to_list[i+1] : words_to_string.count(words_to_list[i] + " " + words_to_list[i+1]) })
    
    for i in xrange(0,len(word_dic)):
    	# Output will also save to : output.txt
    	with codecs.open("output.txt", "a", encoding="utf-8") as myfile:
    		myfile.write(word_dic.keys()[i] + " : " + str(word_dic.values()[i]) + "\n")
    	print word_dic.keys()[i] + " : " + str(word_dic.values()[i])

    :lol: code eka gana blna epa gna puluwan deyak thiynwda balapan
     
    Last edited:
    • Like
    Reactions: sp3co

    sp3co

    Well-known member
  • Sep 22, 2016
    1,495
    425
    83
    Katubedda
    Code:
    # -*- encoding: utf-8 -*-
    
    # converting a unknown formatting file in utf-8
    
    import codecs
    import commands
    
    file_location = "jumper.sub"
    file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
    
    file_stream = codecs.open(file_location, 'r', file_encoding)
    file_output = codecs.open(file_location+"b", 'w', 'utf-8')
    
    for l in file_stream:
        file_output.write(l)
    
    file_stream.close()

    Try this one.. i had same issue year before

    Code:
    # -*- encoding: utf-8 -*-
    import codecs
    import re
    
    def get_words(filepath):
        f = codecs.open('sample3.txt', 'r', 'UTF-16')
        return re.sub(r'\s+', ' ', f.read())
    
    words_to_string = get_words('sample3.txt')
    words_to_list = words_to_string.split()
    
    word_dic = {}
    
    #Getting words into dic and count the words
    for i in xrange(0,len(words_to_list)-1):
    	word_dic.update({words_to_list[i] + " " + words_to_list[i+1] : words_to_string.count(words_to_list[i] + " " + words_to_list[i+1]) })
    
    for i in xrange(0,len(word_dic)):
    	# Output will also save to : output.txt
    	with codecs.open("output.txt", "a", encoding="utf-8") as myfile:
    		myfile.write(word_dic.keys()[i] + " : " + str(word_dic.values()[i]) + "\n")
    	print word_dic.keys()[i] + " : " + str(word_dic.values()[i])

    :lol: code eka gana blna epa gna puluwan deyak thiynwda balapan

    thanks machan. man balannam..


    Udinma encoding eka define karala baluwr nadda?

    udinma kiuwe kothaninda ? file eka open karana thana define karala ne tiyenne
     

    sp3co

    Well-known member
  • Sep 22, 2016
    1,495
    425
    83
    Katubedda
    මචන්ලා වැඩේ හරි . Unicode කියන්නේ utf-16 නේද ?

    encoding = 'utf-16' කරපු ගමන් වැඩේ හරි ගියා