├── README.md └── src ├── alignment.py ├── errorAnalysis.py ├── finalScore.py ├── findTranslationProbability.py ├── languageModelInput.py ├── phraseExtraction.py ├── preprocess.py └── stackDecoding.py /README.md: -------------------------------------------------------------------------------- 1 | Phrase-Based-Translation 2 | ======================== 3 | 4 | This repository consists of project done as part of the course Natural Language Processing - Advanced, Spring 2014.The course was instructed by [Dr. Dipti Misra Sharma](http://www.iiit.ac.in/people/faculty/dipti), [Dr. Ravi Jampani](http://www.cise.ufl.edu/~rjampani/index.html) and [Mr. Akula Arjun Reddy](http://web.iiit.ac.in/~arjunreddy.aug08/) 5 | 6 | 7 | A detailed report is available here 8 | 9 | ##Requirements 10 | * Python 2.6 or above 11 | * GIZA++ 12 | * Language Model (IRSTLM) 13 | 14 | ##Problem 15 | In this project, the phrase based model is implemented. A phrase based model is a simple model for machine translation that is based solely on lexical translation, the translation of phrases. This requires a dictionary that maps phrases from one language to another. We first find the alignment of the word. Next, using the bi-text corpus we train the model and calculate the translational probability. Along with the translation probabilities we use the language model to reflect fluency in English. 16 | 17 | 18 | The source folder consists of the following methods: 19 | 20 | ###Main functions 21 | 22 | * preprocess.py 23 | This module takes as input the bi-text corpuses and the number of sentences. It returns the training and testing dataset along with the sentence pairs. 24 | 25 | Run the following command to create a random set of x sentences: 26 | 27 | **python preprocess.py sourceCorpus targetCorpus numberOfSentencesForTraining** 28 | 29 | It will generate four files: 30 | trainingSource.txt trainingTarget.txt testingSource.txt testingTarget.txt 31 | trainingSource.txt, trainingTarget.txt: contains the given number of sentences 32 | testingSource.txt, testingTarget.txt: contains 5 test sentences which we use later 33 | 34 | Next run the word alignment tool, GIZA++ to obtain the alignments. 35 | 36 | In order to run GIZA++ do the following: 37 | 38 | **./plain2snt.out trainingSource.txt trainingTarget.txt** 39 | **./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt** 40 | 41 | If the previous step gives error, then do: 42 | 43 | **./snt2cooc.out trainingSource.vcb trainingTarget.vcb trainingSource_trainingTarget.snt > cooc.cooc** 44 | **./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt -CoocurrenceFile cooc.cooc** 45 | 46 | This will generate several files. The word alignments are present in A3 file. Repeat this step by swapping the trainingSource.txt and trainingTarget.txt to get the other direction alignment.Let sourceAlignment.txt and targetAlignment.txt be the two files. Then we obtain the phrases as follows: 47 | 48 | * phraseExtraction.py 49 | This function reads two files generated by GIZA++ containing the alignment of the source to target and target to source and returns the all possible phrases associated with it. Run the following command to get the phrases: 50 | 51 | **python phraseExtraction.py sourceAlignment.txt targetAlignment.txt** 52 | The phrases are generated in the file phrases.txt. Next we calculate the translation probability. 53 | 54 | * findTranslationProbability.py 55 | After obtaining the consistent phrases from the phrase extraction algorithm we next move to find the translationProbability. This is done by calculating the relative occurrences of the target phrase for a given source phrase for both directions 56 | 57 | Run the following command: 58 | 59 | **python findTranslationProbability.py phrases.txt** 60 | It will generate two files: 61 | translationProbabilitySourceGivenTarget.txt 62 | translationProbabilityTargetGivenSource.txt 63 | 64 | * languageModelInput.py 65 | This helps in formatting the input file to the language model. It removes all special characters. In order to run this we do the following: 66 | 67 | **python languageModelInput.py trainSource.txt trainS.txt** 68 | **python languageModelInput.py trainTarget.txt trainT.txt** 69 | 70 | Create the zip file for this which is now input for the language model. It is run as follows: 71 | 72 | **./ngt -i="gunzip -c trainS.gz" -n=3 -o=train.www -b=yes** 73 | **./tlm -tr=train.www -n=3 -lm=wb -o=trainS.lm** 74 | **./ngt -i="gunzip -c trainT.gz" -n=3 -o=train.www -b=yes** 75 | **./tlm -tr=train.www -n=3 -lm=wb -o=trainT.lm** 76 | 77 | * finalScore.py 78 | 79 | After obtaining the translationProbability from the alignment matrix,it combines the translation probability from the language model and returns the findTranslationProbability. 80 | 81 | Run the follwowing command for both directions: 82 | **python finalScore.py translationProbabilityTargetGivenSource.txt trainSource.lm 83 | finalTranslationProbabilityTargetGivenSource.txt** 84 | **python finalScore.py translationProbabilitySourceGivenTarget.txt trainTarget.lm finalTranslationProbabilitySourceGivenTarget.txt** 85 | 86 | It returns the file final Translation Probabilities 87 | 88 | * stackDecoding.py 89 | Once we obtain the final tranlation probabilites we obtain the best phrase translation. This function gives the translation for a given sentence based on hypothesis recombiniation. Run the following command: 90 | 91 | **python finalScore.py finalTranslationProbabilityTargetGivenSource.txt testingTarget.txt** 92 | **python finalScore.py finalTranslationProbabilitySourceGivenTarget.txt testingSource.txt** 93 | 94 | ###Helper Function: 95 | * alignment.py 96 | This is a helper function which generates the word alignment matrix for a pair of sentences. 97 | 98 | ###Error Analysis 99 | The method errorAnalysis.py takes as input in a very specific format. Given the source sentence, the translated sentence and the actual translation separated by newline, it returns the precision and recall for the input file in evalution.txt -------------------------------------------------------------------------------- /src/alignment.py: -------------------------------------------------------------------------------- 1 | '''given a pair of sentences along with the word alignment this code returns the union of the word alignment matrix''' 2 | '''it serves as a helper function to phraseExtraction Algorithm''' 3 | 4 | from collections import defaultdict 5 | import string 6 | 7 | def findAlignment(german, englishAligned, english, germanAligned): 8 | 9 | wordAlignment = defaultdict(lambda: defaultdict(int)) 10 | wordIndexEnglish = defaultdict(lambda: -1) 11 | wordIndexGerman = defaultdict(lambda: -1) 12 | 13 | german = german.strip().split() 14 | for i in range(len(german)): 15 | german[i] = german[i].translate(string.maketrans("",""), string.punctuation) 16 | 17 | 18 | english = english.strip().split() 19 | for i in range(len(english)): 20 | english[i] = english[i].translate(string.maketrans("",""), string.punctuation) 21 | 22 | 23 | englishAligned = englishAligned.strip().split(" })") 24 | englishAligned = englishAligned[1:] 25 | count = 0 26 | for key in englishAligned: 27 | words = key.split('({') 28 | if len(words)>1 and words[1]!='': 29 | englishWord = words[0].strip() 30 | englishWord = englishWord.translate(string.maketrans("",""), string.punctuation) 31 | indices = words[1].split() 32 | for i in indices: 33 | i = int(i) 34 | wordAlignment[count][i-1] = 1 35 | count += 1 36 | 37 | germanAligned = germanAligned.strip().split(" })") 38 | germanAligned = germanAligned[1:] 39 | count = 0 40 | for key in germanAligned: 41 | words = key.split('({') 42 | if len(words)>1 and words[1]!='': 43 | germanWord = words[0].strip() 44 | germanWord = germanWord.translate(string.maketrans("",""), string.punctuation) 45 | indices = words[1].split() 46 | for i in indices: 47 | i= int(i) 48 | wordAlignment[i-1][count] = 1 49 | count +=1 50 | 51 | return wordAlignment, english, german -------------------------------------------------------------------------------- /src/errorAnalysis.py: -------------------------------------------------------------------------------- 1 | '''this method takes as input the translation, actual input and the real output and gives a precision and recall value for each sentence''' 2 | 3 | import sys 4 | 5 | def calculatePrecision(translation, actual): 6 | '''calculate precision''' 7 | count = 0 8 | for key in translation: 9 | if key in actual: 10 | count+=1 11 | return float(count)/len(translation) 12 | 13 | def calculateRecall(translation, actual): 14 | '''calculate recall''' 15 | count = 0 16 | for key in actual: 17 | if key in translation: 18 | count+=1 19 | return float(count)/len(actual) 20 | 21 | 22 | def main(): 23 | if len(sys.argv)!=2: #check arguments 24 | print "Usage :: python errorAnalysis.py translationEnglishToGermanTraining.txt " 25 | sys.exit(0) 26 | 27 | data = [] 28 | data.append(sys.argv[1]) 29 | f=open(sys.argv[1],'r') 30 | for line in f: 31 | translation = f.next().strip().split(' ') 32 | actual = f.next().strip().split(' ') 33 | precision = calculatePrecision(translation, actual) 34 | recall = calculateRecall(translation,actual) 35 | data.append(str(precision)+'\t'+str(recall)) 36 | f.close() 37 | 38 | f=open('evaluation.txt','a') 39 | f.write('\n'.join(data)) 40 | f.write('\n') 41 | f.close() 42 | 43 | if __name__ == "__main__": #main 44 | main() -------------------------------------------------------------------------------- /src/finalScore.py: -------------------------------------------------------------------------------- 1 | '''after obtaining the translationProbability from the alignment matrix,it combines the translation probability from the 2 | language model and returns the findTranslationProbability''' 3 | '''it takes as input the translationProbability as obtained from the word alignments and the unigram probability obtained from 4 | the language model for the source language and returns a final score in the file finalTranslationProbability.txt''' 5 | 6 | from collections import defaultdict 7 | import sys 8 | 9 | def calculateProbability(translationFile, languageModelFile, outputFileName): 10 | 11 | '''method to combine the translationProbability obtained from languageModel and the alignment file''' 12 | plm = {} 13 | f=open(languageModelFile,'r') 14 | for line in f: 15 | line = line.strip().split('\t') 16 | if len(line)==2: 17 | plm[line[1]] = float(line[0])*-1 18 | f.close() 19 | 20 | data=[] 21 | f = open(translationFile, 'r') 22 | for line in f: 23 | words = line.strip().split('\t') 24 | sourceWords = words[1].split(' ') 25 | value = float(words[2]) 26 | temp = 1 27 | flag = 0 28 | for key in sourceWords: 29 | if key in plm: 30 | flag = 1 31 | temp *= plm[key] 32 | if flag: 33 | temp = temp * -1 34 | value += temp 35 | data.append(words[0]+'\t'+words[1]+'\t'+str(value)) 36 | f.close() 37 | 38 | f=open(outputFileName ,'w') 39 | f.write('\n'.join(data)) 40 | f.close() 41 | 42 | 43 | def main(): 44 | if len(sys.argv)!=4: #check arguments 45 | print "Usage :: python finalScore.py translationProbabilityTargetGivenSource.txt trainSource.lm finalTranslationProbabilityTargetGivenSource.txt " 46 | sys.exit(0) 47 | 48 | calculateProbability(sys.argv[1], sys.argv[2], sys.argv[3]) 49 | 50 | if __name__ == "__main__": #main 51 | main() -------------------------------------------------------------------------------- /src/findTranslationProbability.py: -------------------------------------------------------------------------------- 1 | '''after obtaining the consistent phrases from the phrase extraction algorithm we next move to find the translationProbability 2 | this is done by calculating the relative occurrences of the target phrase for a given source phrase for both directions''' 3 | '''it takes as input the phrases.txt file and returns the translationProbability in the file named 4 | translationProbabilityGermanGivenEnglish.txt and translationProbabilityEnglishGivenGerman.txt''' 5 | 6 | from collections import defaultdict 7 | import sys 8 | import math 9 | 10 | countGerman=defaultdict(lambda: defaultdict(int)) 11 | sumCountEnglish=defaultdict(int) 12 | countEnglish = defaultdict(lambda: defaultdict(int)) 13 | sumCountGerman = defaultdict(int) 14 | 15 | 16 | def findTranslationProbability(phrasesFile): 17 | 18 | f = open (phrasesFile, 'r') 19 | for line in f: 20 | phrases = line.strip().split('\t') 21 | if len(phrases) == 2: 22 | countGerman[phrases[0]][phrases[1]]+=1 23 | sumCountEnglish[phrases[0]]+=1 24 | countEnglish[phrases[1]][phrases[0]]+=1 25 | sumCountGerman[phrases[1]]+=1 26 | f.close 27 | 28 | data=[] 29 | for key in countGerman: 30 | for key1 in countGerman[key]: 31 | translationProbability = math.log(float(countGerman[key][key1])/sumCountEnglish[key]) 32 | data.append(key1+'\t'+key +'\t'+str(translationProbability)) 33 | 34 | f=open('translationProbabilityTargetGivenSource.txt','w') 35 | f.write('\n'.join(data)) 36 | f.close() 37 | 38 | data=[] 39 | for key in countEnglish: 40 | for key1 in countEnglish[key]: 41 | translationProbability = math.log(float(countEnglish[key][key1])/sumCountGerman[key]) 42 | data.append(key1+'\t'+key+'\t'+str(translationProbability)) 43 | 44 | f=open('translationProbabilitySourceGivenTarget.txt','w') 45 | f.write('\n'.join(data)) 46 | f.close() 47 | 48 | def main(): 49 | if len(sys.argv)!=2: #check arguments 50 | print "Usage :: python findTranslationProbability phrases.txt" 51 | sys.exit(0) 52 | 53 | findTranslationProbability(sys.argv[1]) 54 | 55 | if __name__ == "__main__": #main 56 | main() 57 | -------------------------------------------------------------------------------- /src/languageModelInput.py: -------------------------------------------------------------------------------- 1 | '''this function helps in generating the language model input''' 2 | 3 | import string 4 | import sys 5 | 6 | def createInput(inputFile,outputFile): 7 | data=[] 8 | f=open(inputFile,'r') 9 | for line in f: 10 | words = line.strip().split() 11 | for i in range(len(words)): 12 | words[i] = words[i].translate(string.maketrans("",""), string.punctuation) 13 | line=' '.join(words) 14 | data.append(line) 15 | f.close() 16 | 17 | f=open(outputFile,'w') 18 | f.write('\n'.join(data)) 19 | f.close() 20 | 21 | def main(): 22 | if len(sys.argv)!=3: #check arguments 23 | print "Usage :: python languageModelInput trainSource.txt trainS.txt " 24 | sys.exit(0) 25 | 26 | createInput(sys.argv[1], sys.argv[2]) 27 | 28 | if __name__ == "__main__": #main 29 | main() -------------------------------------------------------------------------------- /src/phraseExtraction.py: -------------------------------------------------------------------------------- 1 | '''-t''' 2 | ''' it takes as input the word alignment of both the languages and returns a file called phrases.txt which contains 3 | all the consistent phrases''' 4 | 5 | import sys #import libraries 6 | from alignment import findAlignment 7 | 8 | 9 | 10 | def checkConsistency(fstart, fend, estart, eend, wordAlignment, english, german): 11 | 12 | '''function to check whether the phrase is consistent or not''' 13 | flag =1 14 | 15 | listEnglish =[] 16 | listGerman = [] 17 | 18 | for i in range(len(english)): 19 | if i >= estart and i<=eend: 20 | listEnglish.append(i) 21 | 22 | for i in range(len(german)): 23 | if i >= fstart and i <=fend: 24 | listGerman.append(i) 25 | 26 | for e in listEnglish: 27 | for f in range(len(german)): 28 | if wordAlignment[e][f]==1: 29 | if f >= fstart and f <=fend: 30 | continue 31 | else: 32 | flag = 0 33 | 34 | for f in listGerman: 35 | for e in range(len(english)): 36 | if wordAlignment[e][f]==1: 37 | if e >= estart and e <=eend: 38 | continue 39 | else: 40 | flag = 0 41 | 42 | return flag 43 | 44 | 45 | def findPhrase(fstart, fend, estart, eend, english, german): 46 | 47 | '''given the starting and end points, it returns the phrase for both the source and the target language''' 48 | phraseE = [] 49 | #print fstart, fend, estart, eend 50 | for i in range(estart,eend+1): 51 | phraseE.append(english[i]) 52 | 53 | phraseG = [] 54 | for i in range(fstart,fend+1): 55 | phraseG.append(german[i]) 56 | 57 | return [' '.join(phraseE), ' '.join(phraseG)] 58 | 59 | def extract(fstart, fend, estart, eend, wordAlignment, english, german): 60 | '''this method extracts the consistent phrases and returns it to the extractPhrases method''' 61 | if fend == -1: 62 | return 'NULL' 63 | 64 | else: 65 | flag=checkConsistency(fstart, fend, estart, eend, wordAlignment, english, german) 66 | if flag: 67 | return findPhrase(fstart, fend, estart, eend, english, german) 68 | else: 69 | return 'NULL' 70 | 71 | def extractPhrases(englishToGerman, germanToEnglish): 72 | '''this method reads the file for both the source and target language and returns the phrases extracted from the 73 | sentences. The phrases are consistent in nature''' 74 | 75 | data=[] 76 | feg = open(englishToGerman, 'r') 77 | fge = open(germanToEnglish,'r') 78 | count = 0 79 | while True: 80 | count+=1 81 | print count 82 | line = feg.readline() 83 | if line == "": 84 | break 85 | englishToGerman1 = feg.readline() 86 | englishToGerman2 = feg.readline() 87 | #print englishToGerman1 88 | line = fge.readline() 89 | germanToEnglish1 = fge.readline() 90 | germanToEnglish2 = fge.readline() 91 | #print germanToEnglish1 92 | 93 | wordAlignment, english, german = findAlignment(englishToGerman1, englishToGerman2, germanToEnglish1, germanToEnglish2) 94 | 95 | lEnglish = len(english) 96 | lGerman = len(german) 97 | 98 | phrases = [] 99 | for estart in range(lEnglish): 100 | for eend in range(estart,(lEnglish)): 101 | fstart = lGerman 102 | fend = -1 103 | for i in wordAlignment: 104 | if i <= eend and i >= estart: 105 | for j in wordAlignment[i]: 106 | fstart = min(j, fstart) 107 | fend = max(j, fend) 108 | if ((eend - estart) <= 20) or ((fend -fstart) <= 20) : 109 | phrases.append([estart, eend, fstart, fend]) 110 | #print phrases 111 | for key in phrases: 112 | estart = key[0] 113 | eend = key[1] 114 | fstart = key [2] 115 | fend = key[3] 116 | phrase = extract (fstart, fend,estart, eend,wordAlignment, english, german) 117 | if phrase!= 'NULL': 118 | #print phrase 119 | data.append(phrase[0]+'\t'+phrase[1]) 120 | feg.close() 121 | fge.close() 122 | 123 | f=open('phrases.txt','w') 124 | f.write('\n'.join(data)) 125 | f.close() 126 | 127 | def main(): 128 | if len(sys.argv)!=3: #check arguments 129 | print "Usage :: python phraseExtraction.py englishToGerman germanToEnglish" 130 | sys.exit(0) 131 | 132 | extractPhrases(sys.argv[1], sys.argv[2]) 133 | 134 | if __name__ == "__main__": #main 135 | main() 136 | -------------------------------------------------------------------------------- /src/preprocess.py: -------------------------------------------------------------------------------- 1 | #This module takes as input the bi-text corpuses and the number of sentences. 2 | #It returns the training and testing dataset along with the sentence pairs. 3 | 4 | import random #import libraries 5 | import string 6 | import sys 7 | from collections import defaultdict 8 | 9 | def preprocessing(numberOfSentences, sourceFile, targetFile): 10 | 11 | indices=defaultdict(int) 12 | trainingSource=[] 13 | trainingTarget=[] 14 | testingSource=[] 15 | testingTarget=[] 16 | 17 | 18 | for i in range(numberOfSentences): #create random numbers 19 | indices[random.randint(0,1920208)] = 1 20 | 21 | with open(sourceFile,'r') as fSource: #read from source language corpus 22 | for index,line in enumerate(fSource): 23 | if len(line)>0: 24 | line = line.strip() 25 | if indices[index] ==1: 26 | trainingSource.append(line) #create training and testing for source language 27 | else: 28 | testingSource.append(line) 29 | 30 | with open(targetFile,'r') as fTarget: #read from source language corpus 31 | for index,line in enumerate(fTarget): 32 | if len(line)>0: 33 | line = line.strip() 34 | if indices[index] ==1: 35 | trainingTarget.append(line) #create training and testing for source language 36 | else: 37 | testingTarget.append(line) 38 | 39 | 40 | with open('../Dataset/trainingSource.txt','wb') as f: #write into training file for source data 41 | f.write('\n'.join(trainingSource)) 42 | 43 | with open('../Dataset/trainingTarget.txt','wb') as f: #write into training file for target data 44 | f.write('\n'.join(trainingTarget)) 45 | 46 | testingSource=testingSource[:5] #write into testing file for source data 47 | with open('../Dataset/testingSource.txt','wb') as f: 48 | f.write('\n'.join(testingSource)) 49 | 50 | testingTarget=testingTarget[:5] #write into testing file for target data 51 | with open('../Dataset/testingTarget.txt','wb') as f: 52 | f.write('\n'.join(testingTarget)) 53 | 54 | #return sentencePair 55 | 56 | def main(): 57 | if len(sys.argv)!= 4: #check arguments 58 | print "Usage :: python preprocess.py file_source file_target numberOfSentencesForTraining" 59 | sys.exit(0) 60 | 61 | numberOfSentences=int(sys.argv[3]) #initialisation 62 | preprocessing(numberOfSentences, sys.argv[1], sys.argv[2] ) 63 | 64 | 65 | if __name__ == "__main__": #main 66 | main() 67 | -------------------------------------------------------------------------------- /src/stackDecoding.py: -------------------------------------------------------------------------------- 1 | '''this function gives the translation for a given sentence based on hypothesis recombiniation.''' 2 | '''it takes as input the finalTranslationProbability and the input file and returns the output translation in translation.txt''' 3 | 4 | import sys 5 | from collections import defaultdict 6 | import operator 7 | import string 8 | 9 | def findBestTranslation(finalTranslationProbability, inputFile): 10 | 11 | tp = defaultdict(dict) 12 | f=open(finalTranslationProbability,'r') 13 | for line in f: 14 | line = line.strip().split('\t') 15 | line[0] = line[0].translate(string.maketrans("",""), string.punctuation) 16 | line[1] = line[1].translate(string.maketrans("",""), string.punctuation) 17 | tp[line[0]][line[1]] = float(line[2]) 18 | f.close() 19 | 20 | 21 | 22 | data=[] 23 | f=open(inputFile,'r') 24 | for line in f: 25 | translationScore = defaultdict(int) 26 | translationSentence = defaultdict(list) 27 | words = line.strip().split(' ') 28 | for i in range(len(words)): 29 | words[i] = words[i].translate(string.maketrans("",""), string.punctuation) 30 | count = 1 31 | for i in range(len(words)): 32 | translation = '' 33 | for j in range(len(words)-count+1): 34 | phrase = words[j:j+count] 35 | phrase = ' '.join(phrase) 36 | #print phrase 37 | if phrase in tp: 38 | translationPhrase = max(tp[phrase].iteritems(), key=operator.itemgetter(1))[0] 39 | translationScore[count]+=tp[phrase][translationPhrase] 40 | translation+=translationPhrase+' ' 41 | if translation!='': 42 | translationSentence[count].append(translation) 43 | count+=1 44 | index = max(translationScore.iteritems(), key=operator.itemgetter(1))[0] 45 | finalTranslation = ' '.join(translationSentence[index]) 46 | data.append(finalTranslation) 47 | 48 | f=open('translation.txt','w') 49 | f.write('\n'.join(data)) 50 | f.close() 51 | 52 | def main(): 53 | if len(sys.argv)!=3: #check arguments 54 | print "Usage :: python finalScore.py finalTranslationProbability.txt inputFile.txt " 55 | sys.exit(0) 56 | 57 | findBestTranslation(sys.argv[1], sys.argv[2]) 58 | 59 | if __name__ == "__main__": #main 60 | main() --------------------------------------------------------------------------------